Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration

2024-06-10 Thread Peter Xu
On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
> >> AIUI, the issue here that users are already allowed to specify in
> >> libvirt the equivalent to direct-io and multifd independent of each
> >> other (bypass-cache, parallel). To start requiring both together now in
> >> some situations would be a regression. I confess I don't know libvirt
> >> code to know whether this can be worked around somehow, but as I said,
> >> it's a relatively simple change from the QEMU side.
> >
> > Firstly, I definitely want to already avoid all the calls to either
> > migration_direct_io_start() or *_finish(), now we already need to
> > explicitly call them in three paths, and that's not intuitive and less
> > readable, just like the hard coded rdma codes.
> 
> Right, but that's just a side-effect of how the code is structured and
> the fact that writes to the stream happen in small chunks. Setting
> O_DIRECT needs to happen around aligned IO. We could move the calls
> further down into qemu_put_buffer_at(), but that would be four fcntl()
> calls for every page.

Hmm.. why we need four fcntl()s instead of two?

> 
> A tangent:
>  one thing that occured to me now is that we may be able to restrict
>  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
>  use that function to gather the correct amount of data before writing,
>  making sure it disables O_DIRECT in case alignment is about to be
>  broken?

IIUC dio doesn't require alignment if we don't care about perf?  I meant it
should be legal to write(fd, buffer, 5) even if O_DIRECT?

I just noticed the asserts you added in previous patch, I think that's
better indeed, but still I'm wondering whether we can avoid enabling it on
qemufile.

It makes me feel slightly nervous when introducing dio to QEMUFile rather
than iochannels - the API design of QEMUFile seems to easily encourage
breaking things in dio worlds with a default and static buffering. And if
we're going to blacklist most of the API anyway except the new one for
mapped-ram, I start to wonder too why bother on top of QEMUFile anyway.

IIRC you also mentioned in the previous doc patch so that libvirt should
always pass in two fds anyway to the fdset if dio is enabled.  I wonder
whether it's also true for multifd=off and directio=on, then would it be
possible to use the dio for guest pages with one fd, while keeping the
normal stream to use !dio with the other fd.  I'm not sure whether it's
easy to avoid qemufile in the dio fd, even if not looks like we may avoid
frequent fcntl()s?

-- 
Peter Xu




Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-10 Thread Peter Xu
On Fri, Jun 07, 2024 at 08:49:01AM +, Gonglei (Arei) wrote:
> Actually we tried this solution, but it didn't work. Pls see patch 3/6
> 
> Known limitations: 
>   For a blocking rsocket fd, if we use io_create_watch to wait for
>   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
>   cannot determine when it is not ready to read/write as we can with
>   non-blocking fds. Therefore, when an event occurs, it will occurs
>   always, potentially leave the qemu hanging. So we need be cautious
>   to avoid hanging when using io_create_watch .

I'm not sure I fully get that part, though.  In:

https://lore.kernel.org/all/ZldY21xVExtiMddB@x1n/

I was thinking of iochannel implements its own poll with the _POLL flag, so
in that case it'll call qio_channel_poll() which should call rpoll()
directly. So I didn't expect using qio_channel_create_watch().  I thought
the context was gmainloop won't work with rsocket fds in general, but maybe
I missed something.

Thanks,

-- 
Peter Xu




Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-10 Thread Peter Xu
On Fri, Jun 07, 2024 at 08:28:29AM +, Gonglei (Arei) wrote:
> 
> 
> > -Original Message-
> > From: Jinpu Wang [mailto:jinpu.w...@ionos.com]
> > Sent: Friday, June 7, 2024 1:54 PM
> > To: Gonglei (Arei) 
> > Cc: qemu-devel@nongnu.org; pet...@redhat.com; yu.zh...@ionos.com;
> > mgal...@akamai.com; elmar.ger...@ionos.com; zhengchuan
> > ; berra...@redhat.com; arm...@redhat.com;
> > lizhij...@fujitsu.com; pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> > ; linux-r...@vger.kernel.org; lixiao (H)
> > ; Wangjialin 
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > Hi Gonglei, hi folks on the list,
> > 
> > On Tue, Jun 4, 2024 at 2:14 PM Gonglei  wrote:
> > >
> > > From: Jialin Wang 
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > First thx for the effort, we are running migration tests on our IB fabric, 
> > different
> > generation of HCA from mellanox, the migration works ok, there are a few
> > failures,  Yu will share the result later separately.
> > 
> 
> Thank you so much. 
> 
> > The one blocker for the change is the old implementation and the new rsocket
> > implementation; they don't talk to each other due to the effect of 
> > different wire
> > protocol during connection establishment.
> > eg the old RDMA migration has special control message during the migration
> > flow, which rsocket use a different control message, so there lead to no 
> > way to
> > migrate VM using rdma transport pre to the rsocket patchset to a new version
> > with rsocket implementation.
> > 
> > Probably we should keep both implementation for a while, mark the old
> > implementation as deprecated, and promote the new implementation, and
> > high light in doc, they are not compatible.
> > 
> 
> IMO It makes sense. What's your opinion? @Peter.

Sounds good to me.  We can use an internal property field and enable
rsocket rdma migration on new machine types with rdma protocol, deprecating
both old rdma and that internal field after 2 releases.  So that when
receiving rdma migrations it'll use old property (as old qemu will use old
machine types), but when initiating rdma migration on new binary it'll
switch to rsocket.

It might be more important to address either the failures or perf concerns
that others raised, though.

Thanks,

-- 
Peter Xu




Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration

2024-06-10 Thread Peter Xu
On Fri, Jun 07, 2024 at 03:42:35PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
> >> We've recently added support for direct-io with multifd, which brings
> >> performance benefits, but creates a non-uniform user interface by
> >> coupling direct-io with the multifd capability. This means that users
> >> cannot keep the direct-io flag enabled while disabling multifd.
> >> 
> >> Libvirt in particular already has support for direct-io and parallel
> >> migration separately from each other, so it would be a regression to
> >> now require both options together. It's relatively simple for QEMU to
> >> add support for direct-io migration without multifd, so let's do this
> >> in order to keep both options decoupled.
> >> 
> >> We cannot simply enable the O_DIRECT flag, however, because not all IO
> >> performed by the migration thread satisfies the alignment requirements
> >> of O_DIRECT. There are many small read & writes that add headers and
> >> synchronization flags to the stream, which at the moment are required
> >> to always be present.
> >> 
> >> Fortunately, due to fixed-ram migration there is a discernible moment
> >> where only RAM pages are written to the migration file. Enable
> >> direct-io during that moment.
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >
> > Is anyone going to consume this?  How's the performance?
> 
> I don't think we have a pre-determined consumer for this. This came up
> in an internal discussion about making the interface simpler for libvirt
> and in a thread on the libvirt mailing list[1] about using O_DIRECT to
> keep the snapshot data out of the caches to avoid impacting the rest of
> the system. (I could have described this better in the commit message,
> sorry).
> 
> Quoting Daniel:
> 
>   "Note the reason for using O_DIRECT is *not* to make saving / restoring
>the guest VM faster. Rather it is to ensure that saving/restoring a VM
>does not trash the host I/O / buffer cache, which will negatively impact
>performance of all the *other* concurrently running VMs."
> 
> 1- https://lore.kernel.org/r/87sez86ztq@suse.de
> 
> About performance, a quick test on a stopped 30G guest, shows
> mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
> direct-io=off.

Yes, this makes sense.

> 
> >
> > It doesn't look super fast to me if we need to enable/disable dio in each
> > loop.. then it's a matter of whether we should bother, or would it be
> > easier that we simply require multifd when direct-io=on.
> 
> AIUI, the issue here that users are already allowed to specify in
> libvirt the equivalent to direct-io and multifd independent of each
> other (bypass-cache, parallel). To start requiring both together now in
> some situations would be a regression. I confess I don't know libvirt
> code to know whether this can be worked around somehow, but as I said,
> it's a relatively simple change from the QEMU side.

Firstly, I definitely want to already avoid all the calls to either
migration_direct_io_start() or *_finish(), now we already need to
explicitly call them in three paths, and that's not intuitive and less
readable, just like the hard coded rdma codes.

I also worry we may overlook the complexity here, and pinning buffers
definitely need more thoughts on its own.  It's easier to digest when using
multifd and when QEMU only pins guest pages just like tcp-zerocopy does,
which are naturally host page size aligned, and also guaranteed to not be
freed (while reused / modified is fine here, as dirty tracking guarantees a
new page will be migrated soon again).

IMHO here the "not be freed / modified" is even more important than
"alignment": the latter is about perf, the former is about correctness.
When we do directio on random buffers, AFAIU we don't want to have the
buffer modified before flushed to disk, and that's IMHO not easy to
guarantee.

E.g., I don't think this guarantees a flush on the buffer usages:

  migration_direct_io_start()
/* flush any potentially unaligned IO before setting O_DIRECT */
qemu_fflush(file);

qemu_fflush() internally does writev(), and that "flush" is about "flushing
qemufile iov[] to fd", not "flushing buffers to disk".  I think it means
if we do qemu_fflush() then we modify QEMUFile.buf[IO_BUF_SIZE] we're
doomed: we will never know whether dio has happened, and which version of
buffer will be sent; I don't think it's guaranteed it will always be the
old version of the buffer.

However the issue is, QEMUFile defines qemu_fflush() as: after call, the
buf[] can be

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote:
> > > I just noticed this thread; some random notes from a somewhat
> > > fragmented memory of this:
> > > 
> > >   a) Long long ago, I also tried rsocket; 
> > >   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > >  as I remember the library was quite flaky at the time.
> > 
> > Hmm interesting.  There also looks like a thread doing rpoll().
> 
> Yeh, I can't actually remember much more about what I did back then!

Heh, that's understandable and fair. :)

> > I hope Lei and his team has tested >4G mem, otherwise definitely worth
> > checking.  Lei also mentioned there're rsocket bugs they found in the cover
> > letter, but not sure what's that about.
> 
> It would probably be a good idea to keep track of what bugs
> are in flight with it, and try it on a few RDMA cards to see
> what problems get triggered.
> I think I reported a few at the time, but I gave up after
> feeling it was getting very hacky.

Agreed.  Maybe we can have a list of that in the cover letter or even
QEMU's migration/rmda doc page.

Lei, if you think that makes sense please do so in your upcoming posts.
There'll need to have a list of things you encountered in the kernel driver
and it'll be even better if there're further links to read on each problem.

> > > 
> > >   e) Someone made a good suggestion (sorry can't remember who) - that the
> > >  RDMA migration structure was the wrong way around - it should be the
> > >  destination which initiates an RDMA read, rather than the source
> > >  doing a write; then things might become a LOT simpler; you just need
> > >  to send page ranges to the destination and it can pull it.
> > >  That might work nicely for postcopy.
> > 
> > I'm not sure whether it'll still be a problem if rdma recv side is based on
> > zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> > that we don't want the guest vcpus to see a partially copied page during
> > on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> > that.
> 
> Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if
> the destination can issue an RDMA read itself, it doesn't need to send 
> messages
> to the source to ask for a page fetch; it just goes and grabs it itself,
> that's got to be good for latency.

Oh, that's pretty internal stuff of rdma to me and beyond my knowledge..
but from what I can tell it sounds very reasonable indeed!

Thanks!

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote:
> >   e) Someone made a good suggestion (sorry can't remember who) - that the
> >  RDMA migration structure was the wrong way around - it should be the
> >  destination which initiates an RDMA read, rather than the source
> >  doing a write; then things might become a LOT simpler; you just need
> >  to send page ranges to the destination and it can pull it.
> >  That might work nicely for postcopy.
> 
> I'm not sure whether it'll still be a problem if rdma recv side is based on
> zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> that we don't want the guest vcpus to see a partially copied page during
> on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> that.

And when thinking about this (of UFFDIO_COPY's nature on not being able to
do zero-copy...), the only way this will be able to do zerocopy is to use
file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that
when we do DMA we pass over the page cache, which can be mapped in another
virtual address besides what the vcpus are using.

Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic
updates on the vcpu pgtables, avoiding the copy.  QEMU doesn't have it, but
it looks like there's one more reason we may want to have better use of
shmem.. than anonymous.  And actually when working on 4k faults on 1G
hugetlb I added CONTINUE support.

https://github.com/xzpeter/qemu/tree/doublemap
https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899740dc5

Maybe it's worthwhile on its own now, because it also means we can use that
in multifd to avoid one extra layer of buffering when supporting
multifd+postcopy (which has the same issue here on directly copying data
into guest pages).  It'll also work with things like rmda I think in
similar ways.  It's just that it'll not work on anonymous.

I definitely hijacked the thread to somewhere too far away.  I'll stop
here..

Thanks,

-- 
Peter Xu




Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-05 Thread Peter Xu
On Wed, Jun 05, 2024 at 10:09:43AM +, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:32 AM
> > To: Gonglei (Arei) 
> > Cc: qemu-devel@nongnu.org; yu.zh...@ionos.com; mgal...@akamai.com;
> > elmar.ger...@ionos.com; zhengchuan ;
> > berra...@redhat.com; arm...@redhat.com; lizhij...@fujitsu.com;
> > pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> > ; linux-r...@vger.kernel.org; lixiao (H)
> > ; jinpu.w...@ionos.com; Wangjialin
> > ; Fabiano Rosas 
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > Hi, Lei, Jialin,
> > 
> > Thanks a lot for working on this!
> > 
> > I think we'll need to wait a bit on feedbacks from Jinpu and his team on 
> > RDMA
> > side, also Daniel for iochannels.  Also, please remember to copy Fabiano
> > Rosas in any relevant future posts.  We'd also like to know whether he has 
> > any
> > comments too.  I have him copied in this reply.
> > 
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang 
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > 
> > It'll be good to elaborate if you tested it in-house. What people should 
> > expect
> > on the numbers exactly?  Is that okay from Huawei's POV?
> > 
> > Besides that, the code looks pretty good at a first glance to me.  Before
> > others chim in, here're some high level comments..
> > 
> > Firstly, can we avoid using coroutine when listen()?  Might be relevant 
> > when I
> > see that rdma_accept_incoming_migration() runs in a loop to do raccept(), 
> > but
> > would that also hang the qemu main loop even with the coroutine, before all
> > channels are ready?  I'm not a coroutine person, but I think the hope is 
> > that
> > we can make dest QEMU run in a thread in the future just like the src QEMU, 
> > so
> > the less coroutine the better in this path.
> > 
> 
> Because rsocket is set to non-blocking, raccept will return EAGAIN when no 
> connection 
> is received, coroutine will yield, and will not hang the qemu main loop.

Ah that's ok.  And also I just noticed it may not be a big deal either as
long as we're before migration_incoming_process().

I'm wondering whether it can do it similarly like what we do with sockets
in qio_net_listener_set_client_func_full().  After all, rsocket wants to
mimic the socket API.  It'll make sense if rsocket code tries to match with
socket, or even reuse.

> 
> > I think I also left a comment elsewhere on whether it would be possible to 
> > allow
> > iochannels implement their own poll() functions to avoid the per-channel 
> > poll
> > thread that is proposed in this series.
> > 
> > https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n
> > 
> 
> We noticed that, and it's a big operation. I'm not sure that's a better way.
> 
> > Personally I think even with the thread proposal it's better than the old 
> > rdma
> > code, but I just still want to double check with you guys.  E.g., maybe 
> > that just
> > won't work at all?  Again, that'll also be based on the fact that we move
> > migration incoming into a thread first to keep the dest QEMU main loop 
> > intact,
> > I think, but I hope we will reach that irrelevant of rdma, IOW it'll be 
> > nice to
> > happen even earlier if possible.
> > 
> Yep. This is a fairly big change, I wonder what other people's suggestions 
> are?

Yes we can wait for others' opinions.  And btw I'm not asking for it and I
don't think it'll be a blocker for this approach to land, as I said this is
better than the current code so it's definitely an improvement to me.

I'm purely curious, because if you're not going to do it for rdma, maybe
someday I'll try to do that, and I want to know what "big change" could be
as I didn't dig further.  It may help me by sharing what issues you've found.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
Hey, Dave!

On Wed, Jun 05, 2024 at 12:31:56AM +, Dr. David Alan Gilbert wrote:
> * Michael Galaxy (mgal...@akamai.com) wrote:
> > One thing to keep in mind here (despite me not having any hardware to test)
> > was that one of the original goals here
> > in the RDMA implementation was not simply raw throughput nor raw latency,
> > but a lack of CPU utilization in kernel
> > space due to the offload. While it is entirely possible that newer hardware
> > w/ TCP might compete, the significant
> > reductions in CPU usage in the TCP/IP stack were a big win at the time.
> > 
> > Just something to consider while you're doing the testing
> 
> I just noticed this thread; some random notes from a somewhat
> fragmented memory of this:
> 
>   a) Long long ago, I also tried rsocket; 
>   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
>  as I remember the library was quite flaky at the time.

Hmm interesting.  There also looks like a thread doing rpoll().

Btw, not sure whether you noticed, but there's the series posted for the
latest rsocket conversion here:

https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gong...@huawei.com

I hope Lei and his team has tested >4G mem, otherwise definitely worth
checking.  Lei also mentioned there're rsocket bugs they found in the cover
letter, but not sure what's that about.

> 
>   b) A lot of the complexity in the rdma migration code comes from
> emulating a stream to carry the migration control data and interleaving
> that with the actual RAM copy.   I believe the original design used
> a separate TCP socket for the control data, and just used the RDMA
> for the data - that should be a lot simpler (but alas was rejected
> in review early on)
> 
>   c) I can't rememmber the last benchmarks I did; but I think I did
> manage to beat RDMA with multifd; but yes, multifd does eat host CPU
> where as RDMA barely uses a whisper.

I think my first impression on this matter came from you on this one. :)

> 
>   d) The 'zero-copy-send' option in migrate may well get some of that
>  CPU time back; but if I remember we were still bottle necked on
>  the receive side. (I can't remember if zero-copy-send worked with
>  multifd?)

Yes, and zero-copy requires multifd for now. I think it's because we didn't
want to complicate the header processings in the migration stream where it
may not be page aligned.

> 
>   e) Someone made a good suggestion (sorry can't remember who) - that the
>  RDMA migration structure was the wrong way around - it should be the
>  destination which initiates an RDMA read, rather than the source
>  doing a write; then things might become a LOT simpler; you just need
>  to send page ranges to the destination and it can pull it.
>  That might work nicely for postcopy.

I'm not sure whether it'll still be a problem if rdma recv side is based on
zero-copy.  It would be a matter of whether atomicity can be guaranteed so
that we don't want the guest vcpus to see a partially copied page during
on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
that.

Thanks,

-- 
Peter Xu




Re: [PATCH 0/7] Live migration acceleration with UADK

2024-06-04 Thread Peter Xu
On Wed, May 29, 2024 at 10:44:20AM +0100, Shameer Kolothum via wrote:
> Hi,
> 
> This series adds support for UADK library based hardware acceleration
> for live migration. UADK[0] is a general-purpose user space accelerator
> framework that uses shared virtual addressing (SVA) to provide a unified
> programming interface for hardware acceleration of cryptographic and
> compression algorithms.
> 
> UADK makes use of the UACCE(Unified/User-space-access-intended Accelerator
> Framework) Linux kernel module which enables hardware accelerators from
> different vendors that support SVA to adapt to UADK. Linux kernel from
> v5.9 has support for UACCE and SVA on ARM64 platforms.
> 
> Currently, HiSilicon Kunpeng hardware accelerators have been registered with
> UACCE and the Zip accelerator on these platforms can be used for compression
> which can  free up CPU computing power and improve computing performance.
> 
> This series is on top of Intel IAA accelerator live migration support
> series[1] from Yuan Liu. Many thanks for doing this.

Just looked at the IAA series too, I didn't read multifd-*.[ch] much on
both sides but both the series look pretty clean to me.

Thanks,

-- 
Peter Xu




Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration

2024-06-04 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
> We've recently added support for direct-io with multifd, which brings
> performance benefits, but creates a non-uniform user interface by
> coupling direct-io with the multifd capability. This means that users
> cannot keep the direct-io flag enabled while disabling multifd.
> 
> Libvirt in particular already has support for direct-io and parallel
> migration separately from each other, so it would be a regression to
> now require both options together. It's relatively simple for QEMU to
> add support for direct-io migration without multifd, so let's do this
> in order to keep both options decoupled.
> 
> We cannot simply enable the O_DIRECT flag, however, because not all IO
> performed by the migration thread satisfies the alignment requirements
> of O_DIRECT. There are many small read & writes that add headers and
> synchronization flags to the stream, which at the moment are required
> to always be present.
> 
> Fortunately, due to fixed-ram migration there is a discernible moment
> where only RAM pages are written to the migration file. Enable
> direct-io during that moment.
> 
> Signed-off-by: Fabiano Rosas 

Is anyone going to consume this?  How's the performance?

It doesn't look super fast to me if we need to enable/disable dio in each
loop.. then it's a matter of whether we should bother, or would it be
easier that we simply require multifd when direct-io=on.

Thanks,

-- 
Peter Xu




Re: [PATCH v2 15/18] tests/qtest/migration: Add a test for mapped-ram with passing of fds

2024-06-04 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:45PM -0300, Fabiano Rosas wrote:
> Add a multifd test for mapped-ram with passing of fds into QEMU. This
> is how libvirt will consume the feature.
> 
> There are a couple of details to the fdset mechanism:
> 
> - multifd needs two distinct file descriptors (not duplicated with
>   dup()) so it can enable O_DIRECT only on the channels that do
>   aligned IO. The dup() system call creates file descriptors that
>   share status flags, of which O_DIRECT is one.
> 
> - the open() access mode flags used for the fds passed into QEMU need
>   to match the flags QEMU uses to open the file. Currently O_WRONLY
>   for src and O_RDONLY for dst.
> 
> Note that fdset code goes under _WIN32 because fd passing is not
> supported on Windows.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2 14/18] migration: Add documentation for fdset with multifd + file

2024-06-04 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:44PM -0300, Fabiano Rosas wrote:
> With the last few changes to the fdset infrastructure, we now allow
> multifd to use an fdset when migrating to a file. This is useful for
> the scenario where the management layer wants to have control over the
> migration file.
> 
> By receiving the file descriptors directly, QEMU can delegate some
> high level operating system operations to the management layer (such
> as mandatory access control). The management layer might also want to
> add its own headers before the migration stream.
> 
> Document the "file:/dev/fdset/#" syntax for the multifd migration with
> mapped-ram. The requirements for the fdset mechanism are:
> 
> - the fdset must contain two fds that are not duplicates between
>   themselves;
> 
> - if direct-io is to be used, exactly one of the fds must have the
>   O_DIRECT flag set;
> 
> - the file must be opened with WRONLY on the migration source side;
> 
> - the file must be opened with RDONLY on the migration destination
>   side.
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  docs/devel/migration/main.rst   | 24 +++-
>  docs/devel/migration/mapped-ram.rst |  6 +-
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 495cdcb112..784c899dca 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -47,11 +47,25 @@ over any transport.
>QEMU interference. Note that QEMU does not flush cached file
>data/metadata at the end of migration.
>  
> -In addition, support is included for migration using RDMA, which
> -transports the page data using ``RDMA``, where the hardware takes care of
> -transporting the pages, and the load on the CPU is much lower.  While the
> -internals of RDMA migration are a bit different, this isn't really visible
> -outside the RAM migration code.
> +  The file migration also supports using a file that has already been
> +  opened. A set of file descriptors is passed to QEMU via an "fdset"
> +  (see add-fd QMP command documentation). This method allows a
> +  management application to have control over the migration file
> +  opening operation. There are, however, strict requirements to this
> +  interface if the multifd capability is enabled:
> +
> +- the fdset must contain two file descriptors that are not
> +  duplicates between themselves;
> +- if the direct-io capability is to be used, exactly one of the
> +  file descriptors must have the O_DIRECT flag set;
> +- the file must be opened with WRONLY on the migration source side
> +  and RDONLY on the migration destination side.
> +
> +- rdma migration: support is included for migration using RDMA, which
> +  transports the page data using ``RDMA``, where the hardware takes
> +  care of transporting the pages, and the load on the CPU is much
> +  lower.  While the internals of RDMA migration are a bit different,
> +  this isn't really visible outside the RAM migration code.
>  
>  All these migration protocols use the same infrastructure to
>  save/restore state devices.  This infrastructure is shared with the
> diff --git a/docs/devel/migration/mapped-ram.rst 
> b/docs/devel/migration/mapped-ram.rst
> index fa4cefd9fc..e6505511f0 100644
> --- a/docs/devel/migration/mapped-ram.rst
> +++ b/docs/devel/migration/mapped-ram.rst
> @@ -16,7 +16,7 @@ location in the file, rather than constantly being added to 
> a
>  sequential stream. Having the pages at fixed offsets also allows the
>  usage of O_DIRECT for save/restore of the migration stream as the
>  pages are ensured to be written respecting O_DIRECT alignment
> -restrictions (direct-io support not yet implemented).
> +restrictions.
>  
>  Usage
>  -
> @@ -35,6 +35,10 @@ Use a ``file:`` URL for migration:
>  Mapped-ram migration is best done non-live, i.e. by stopping the VM on
>  the source side before migrating.
>  
> +For best performance enable the ``direct-io`` capability as well:

Parameter?

> +
> +``migrate_set_capability direct-io on``

migrate_set_parameters?

If with that fixed:

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v7 1/7] docs/migration: add qpl compression feature

2024-06-04 Thread Peter Xu
On Mon, Jun 03, 2024 at 11:41:00PM +0800, Yuan Liu wrote:
> add Intel Query Processing Library (QPL) compression method
> introduction
> 
> Signed-off-by: Yuan Liu 
> Reviewed-by: Nanhai Zou 

Acked-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-04 Thread Peter Xu
Hi, Lei, Jialin,

Thanks a lot for working on this!

I think we'll need to wait a bit on feedbacks from Jinpu and his team on
RDMA side, also Daniel for iochannels.  Also, please remember to copy
Fabiano Rosas in any relevant future posts.  We'd also like to know whether
he has any comments too.  I have him copied in this reply.

On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> From: Jialin Wang 
> 
> Hi,
> 
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
> 
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
> 
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
> 
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-phi...@linaro.org/
> 
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
> 
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!

It'll be good to elaborate if you tested it in-house. What people should
expect on the numbers exactly?  Is that okay from Huawei's POV?

Besides that, the code looks pretty good at a first glance to me.  Before
others chim in, here're some high level comments..

Firstly, can we avoid using coroutine when listen()?  Might be relevant
when I see that rdma_accept_incoming_migration() runs in a loop to do
raccept(), but would that also hang the qemu main loop even with the
coroutine, before all channels are ready?  I'm not a coroutine person, but
I think the hope is that we can make dest QEMU run in a thread in the
future just like the src QEMU, so the less coroutine the better in this
path.

I think I also left a comment elsewhere on whether it would be possible to
allow iochannels implement their own poll() functions to avoid the
per-channel poll thread that is proposed in this series.

https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n

Personally I think even with the thread proposal it's better than the old
rdma code, but I just still want to double check with you guys.  E.g.,
maybe that just won't work at all?  Again, that'll also be based on the
fact that we move migration incoming into a thread first to keep the dest
QEMU main loop intact, I think, but I hope we will reach that irrelevant of
rdma, IOW it'll be nice to happen even earlier if possible.

Another nitpick is that qio_channel_rdma_listen_async() doesn't look used
and may prone to removal.

Thanks,

-- 
Peter Xu




Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-06-04 Thread Peter Xu
On Tue, Jun 04, 2024 at 06:14:08PM +0200, David Hildenbrand wrote:
> On 04.06.24 17:58, Peter Xu wrote:
> > On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote:
> > > On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:
> > > > That property, irrelevant of what it is called (and I doubt whether 
> > > > Dan's
> > > > suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have 
> > > > user
> > > > visible fd but it's shared-ram for sure..), is yet another way to 
> > > > specify
> > > > guest mem types.
> > > > 
> > > > What if the user specified this property but specified something else in
> > > > the -object parameters?  E.g. -machine share-ram=on -object
> > > > memory-backend-ram,share=off.  What should we do?
> > > 
> > > The machine property would only apply to memory regions that are
> > > *NOT* being created via -object. The memory-backend objects would
> > > always honour their own share settnig.
> > 
> > In that case we may want to rename that to share-ram-by-default=on.
> > Otherwise it's not clear which one would take effect from an user POV, even
> > if we can define it like that in the code.
> > 
> > Even with share-ram-by-default=on, it can be still confusing in some form
> > or another. Consider this cmdline:
> > 
> >-machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1
> > 
> > Then is mem1 shared or not?  From reading the cmdline, if share ram by
> > default it should be ON if we don't specify it, but it's actually off?
> > It's because -object has its own default values.
> 
> We do have something similar with "merge" and "dump" properties. See
> machine_mem_merge() / machine_dump_guest_core().
> 
> These correspond to the "mem-merge" and "dump-guest-core" machine
> properties.

These look fine so far, as long as -object cmdline doesn't allow to specify
the same thing again.

> 
> But ...
> 
> > 
> > IMHO fundamentally it's just controversial to have two ways to configure
> > guest memory.  If '-object' is the preferred and complete way to configure
> > it, I prefer sticking with it if possible and see what is missing.
> 
> ... I agree with that. With vhost-user we also require a reasonable
> configuration (using proper fd-based shared memory) for it to work.
> 
> > 
> > I think I raised that as the other major reason too, that I think it's so
> > far only about the vram that is out of the picture here.  We don't and
> > shouldn't have complicated RW RAMs floating around that we want this
> > property to cover.
> 
> Agreed. And maybe we can still keep migration of any MAP_PRIVATE thing
> working by migrating that memory? CPR will be "slightly less fast".
> 
> But the biggest piece -- guest RAM -- will be migrated via the fd directly.

I think it should work but only without VFIO.  When with VFIO there must
have no private pages at all or migrating is racy with concurrent DMAs
(yes, AFAICT CPR can run migration with DMA running..).

CPR has a pretty tricky way of using VFIO pgtables in that it requires the
PFNs to not change before/after migration.  Feel free to have a look at
VFIO_DMA_MAP_FLAG_VADDR in vfio.h then you may get a feeling of it.

Thanks,

-- 
Peter Xu




Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-06-04 Thread Peter Xu
On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote:
> On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:
> > That property, irrelevant of what it is called (and I doubt whether Dan's
> > suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
> > visible fd but it's shared-ram for sure..), is yet another way to specify
> > guest mem types.
> > 
> > What if the user specified this property but specified something else in
> > the -object parameters?  E.g. -machine share-ram=on -object
> > memory-backend-ram,share=off.  What should we do?
> 
> The machine property would only apply to memory regions that are
> *NOT* being created via -object. The memory-backend objects would
> always honour their own share settnig.

In that case we may want to rename that to share-ram-by-default=on.
Otherwise it's not clear which one would take effect from an user POV, even
if we can define it like that in the code.

Even with share-ram-by-default=on, it can be still confusing in some form
or another. Consider this cmdline:

  -machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1

Then is mem1 shared or not?  From reading the cmdline, if share ram by
default it should be ON if we don't specify it, but it's actually off?
It's because -object has its own default values.

IMHO fundamentally it's just controversial to have two ways to configure
guest memory.  If '-object' is the preferred and complete way to configure
it, I prefer sticking with it if possible and see what is missing.

I think I raised that as the other major reason too, that I think it's so
far only about the vram that is out of the picture here.  We don't and
shouldn't have complicated RW RAMs floating around that we want this
property to cover.  We should make sure we introduce this property with
some good reason, rather than "ok we don't know what happened, we don't
know what will leverage this, but maybe there is some floating RAMs..".
Right after we introduce this property we need to carry it forever, and
prepared people start to use it with/without cpr, and that's some
maintenance cost that I want to see whether we can avoid.

Thanks,

-- 
Peter Xu




Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-06-03 Thread Peter Xu
On Fri, May 31, 2024 at 03:32:11PM -0400, Steven Sistare wrote:
> On 5/30/2024 2:39 PM, Peter Xu wrote:
> > On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote:
> > > On 5/29/2024 3:25 PM, Peter Xu wrote:
> > > > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
> > > > > On 5/28/2024 5:44 PM, Peter Xu wrote:
> > > > > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> > > > > > > Preserve fields of RAMBlocks that allocate their host memory 
> > > > > > > during CPR so
> > > > > > > the RAM allocation can be recovered.
> > > > > > 
> > > > > > This sentence itself did not explain much, IMHO.  QEMU can share 
> > > > > > memory
> > > > > > using fd based memory already of all kinds, as long as the memory 
> > > > > > backend
> > > > > > is path-based it can be shared by sharing the same paths to dst.
> > > > > > 
> > > > > > This reads very confusing as a generic concept.  I mean, QEMU 
> > > > > > migration
> > > > > > relies on so many things to work right.  We mostly asks the users 
> > > > > > to "use
> > > > > > exactly the same cmdline for src/dst QEMU unless you know what 
> > > > > > you're
> > > > > > doing", otherwise many things can break.  That should also include 
> > > > > > ramblock
> > > > > > being matched between src/dst due to the same cmdlines provided on 
> > > > > > both
> > > > > > sides.  It'll be confusing to mention this when we thought the 
> > > > > > ramblocks
> > > > > > also rely on that fact.
> > > > > > 
> > > > > > So IIUC this sentence should be dropped in the real patch, and I'll 
> > > > > > try to
> > > > > > guess the real reason with below..
> > > > > 
> > > > > The properties of the implicitly created ramblocks must be preserved.
> > > > > The defaults can and do change between qemu releases, even when the 
> > > > > command-line
> > > > > parameters do not change for the explicit objects that cause these 
> > > > > implicit
> > > > > ramblocks to be created.
> > > > 
> > > > AFAIU, QEMU relies on ramblocks to be the same before this series.  Do 
> > > > you
> > > > have an example?  Would that already cause issue when migrate?
> > > 
> > > Alignment has changed, and used_length vs max_length changed when
> > > resizeable ramblocks were introduced.  I have dealt with these issues
> > > while supporting cpr for our internal use, and the learned lesson is to
> > > explicitly communicate the creation-time parameters to new qemu.
> > 
> > Why used_length can change?  I'm looking at ram_mig_ram_block_resized():
> > 
> >  if (!migration_is_idle()) {
> >  /*
> >   * Precopy code on the source cannot deal with the size of RAM 
> > blocks
> >   * changing at random points in time - especially after sending the
> >   * RAM block sizes in the migration stream, they must no longer 
> > change.
> >   * Abort and indicate a proper reason.
> >   */
> >  error_setg(, "RAM block '%s' resized during precopy.", 
> > rb->idstr);
> >  migration_cancel(err);
> >  error_free(err);
> >  }
> > 
> > We sent used_length upfront of a migration during SETUP phase.  Looks like
> > what you're describing can be something different, though?
> 
> I was imprecise.  used_length did not change; it was introduced as being
> different than max_length when resizeable ramblocks were introduced.
> 
> The max_length is not sent.  It is an implicit property of the implementation,
> and can change.  It is the size of the memfd mapping, so we need to know it
> and preserve it.
> 
> used_length is indeed sent during SETUP.  We could also send max_length
> at that time, and store both in the struct ramblock, and *maybe* that would
> be safe, but that is more fragile and less future proof than setting both
> properties to the correct value when the ramblock struct is created.
> 
> And BTW, the ramblock properties are sent using ad-hoc code in setup.
> I send them using nice clean vmstate.

Right, I agree that's not pretty at all... I wished we have had something
be

Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-06-03 Thread Peter Xu
On Fri, May 31, 2024 at 03:32:05PM -0400, Steven Sistare wrote:
> On 5/30/2024 2:14 PM, Peter Xu wrote:
> > On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote:
> > > On 5/29/2024 3:14 PM, Peter Xu wrote:
> > > > On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
> > > > > > > diff --git a/system/memory.c b/system/memory.c
> > > > > > > index 49f1cb2..ca04a0e 100644
> > > > > > > --- a/system/memory.c
> > > > > > > +++ b/system/memory.c
> > > > > > > @@ -1552,8 +1552,9 @@ bool 
> > > > > > > memory_region_init_ram_nomigrate(MemoryRegion *mr,
> > > > > > >   uint64_t size,
> > > > > > >   Error **errp)
> > > > > > > {
> > > > > > > +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 
> > > > > > > 0;
> > > > > > 
> > > > > > If there's a machine option to "use memfd for allocations", then 
> > > > > > it's
> > > > > > shared mem... Hmm..
> > > > > > 
> > > > > > It is a bit confusing to me in quite a few levels:
> > > > > > 
> > > > > >  - Why memory allocation method will be defined by a machine 
> > > > > > property,
> > > > > >even if we have memory-backend-* which should cover 
> > > > > > everything?
> > > > > 
> > > > > Some memory regions are implicitly created, and have no explicit 
> > > > > representation
> > > > > on the qemu command line.  memfd-alloc affects those.
> > > > > 
> > > > > More generally, memfd-alloc affects all ramblock allocations that are
> > > > > not explicitly represented by memory-backend object.  Thus the simple
> > > > > command line "qemu -m 1G" does not explicitly describe an object, so 
> > > > > it
> > > > > goes through the anonymous allocation path, and is affected by 
> > > > > memfd-alloc.
> > > > 
> > > > Can we simply now allow "qemu -m 1G" to work for cpr-exec?
> > > 
> > > I assume you meant "simply not allow".
> > > 
> > > Yes, I could do that, but I would need to explicitly add code to exclude 
> > > this
> > > case, and add a blocker.  Right now it "just works" for all paths that 
> > > lead to
> > > ram_block_alloc_host, without any special logic at the memory-backend 
> > > level.
> > > And, I'm not convinced that simplifies the docs, as now I would need to 
> > > tell
> > > the user that "-m 1G" and similar constructions do not work with cpr.
> > > 
> > > I can try to clarify the doc for -memfd-alloc as currently defined.
> > 
> > Why do we need to keep cpr working for existing qemu cmdlines?  We'll
> > already need to add more new cmdline options already anyway, right?
> > 
> > cpr-reboot wasn't doing it, and that made sense to me, so that new features
> > will require the user to opt-in for it, starting with changing its
> > cmdlines.
> 
> I agree.  We need a new option to opt-in to cpr-friendly memory allocation, 
> and I
> am proposing -machine memfd-alloc. I am simply saying that I can try to do a 
> better
> job explaining the functionality in my proposed text for memfd-alloc, instead 
> of
> changing the functionality to exclude "-m 1G".  I believe excluding "-m 1G" 
> is the
> wrong approach, for the reasons I stated - messier implementation *and* 
> documentation.
> 
> I am open to different syntax for opting in.

If the machine property is the only way to go then I agree that might be a
good idea, even though the name can be further discussed.  But I still want
to figure out something below.

> 
> > > > AFAIU that's
> > > > what we do with cpr-reboot: we ask the user to specify the right things 
> > > > to
> > > > make other thing work.  Otherwise it won't.
> > > > 
> > > > > 
> > > > > Internally, create_default_memdev does create a memory-backend object.
> > > > > That is what my doc comment above refers to:
> > > > > Any associated memory-backend objects are created with share=on
> > > > > 
> > > > > An explicit "qemu -object mem

Re: [PATCH v2 4/4] tests/qtest/migration-test: Add a postcopy memfile test

2024-06-03 Thread Peter Xu
On Mon, Jun 03, 2024 at 04:02:42PM +1000, Nicholas Piggin wrote:
> On Fri May 31, 2024 at 11:34 PM AEST, Peter Xu wrote:
> > On Thu, May 30, 2024 at 07:54:07PM +1000, Nicholas Piggin wrote:
> > > Postcopy requires userfaultfd support, which requires tmpfs if a memory
> > > file is used.
> > > 
> > > This adds back support for /dev/shm memory files, but adds preallocation
> > > to skip environments where that mount is limited in size.
> > > 
> > > Signed-off-by: Nicholas Piggin 
> >
> > Thanks for doing this regardless.
> >
> > > ---
> > >  tests/qtest/migration-test.c | 77 
> > >  1 file changed, 69 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > > index 86eace354e..5078033ded 100644
> > > --- a/tests/qtest/migration-test.c
> > > +++ b/tests/qtest/migration-test.c
> > > @@ -11,6 +11,7 @@
> > >   */
> > >  
> > >  #include "qemu/osdep.h"
> > > +#include "qemu/cutils.h"
> > >  
> > >  #include "libqtest.h"
> > >  #include "qapi/qmp/qdict.h"
> > > @@ -553,6 +554,7 @@ typedef struct {
> > >   */
> > >  bool hide_stderr;
> > >  bool use_memfile;
> > > +bool use_shm_memfile;
> >
> > Nitpick: when having both, it's slightly confusing on the name, e.g. not
> > clear whether use_memfile needs to be set to true too if 
> > use_shm_memfile=true.
> >
> > Maybe "use_tmpfs_memfile" and "use_shm_memfile"?
> 
> Could be easy to confuse. It's not actually "tmpfs", it is the fs that
> is mounted on /tmp :) tmpfs *is* shmfs in Linux. The intention was just
> that if you don't specify then it's because you don't have a particular
> requirement other than enough space.

Ah sorry, yeah I meant use_tmp_memfile..

> 
> > >  /* only launch the target process */
> > >  bool only_target;
> > >  /* Use dirty ring if true; dirty logging otherwise */
> > > @@ -739,7 +741,62 @@ static int test_migrate_start(QTestState **from, 
> > > QTestState **to,
> > >  ignore_stderr = "";
> > >  }
> > >  
> > > -if (args->use_memfile) {
> > > +if (!qtest_has_machine(machine_alias)) {
> > > +g_autofree char *msg = g_strdup_printf("machine %s not 
> > > supported",
> > > +   machine_alias);
> > > +g_test_skip(msg);
> > > +return -1;
> > > +}
> > > +
> > > +if (args->use_shm_memfile) {
> > > +#if defined(__NR_userfaultfd) && defined(__linux__)
> >
> > IIUC we only need defined(__linux__) as there's nothing to do with uffd yet?
> 
> I thought it was polite since it uses a few other Linux (or at least
> POSIX) calls directly rather than go via the abstraction layer. Probably
> it would never happen that something defines __NR_userfaultfd that does
> not also have open and fallocate, but no harm?

It's about when there're shmem tests added without postcopy, we may want
the host to run these tests even if that host doesn't support userfault
syscall.

$ grep -iE "(SHMEM|USERFAULTFD)=" .config
CONFIG_SHMEM=y
CONFIG_USERFAULTFD=y

So I want to make sure the test runs the right thing always, irrelevant of
which arch it ran on, or kernel config.

I agree that's not a huge deal, but still I wanted to remove the
collreation that userfault and shmem is closely related - they're just
totally irrelevant to me, e.g., we can have shmem test/hostconfig without
userfault, we can also have userfault test/hostconfig without shmem.

> 
> > > +int fd;
> > > +uint64_t size;
> > > +
> > > +if (getenv("GITLAB_CI")) {
> > > +/*
> > > + * Gitlab runners are limited to 64MB shm size and despite
> > > + * pre-allocation there is concern that concurrent tests
> > > + * could result in nondeterministic failures. Until all shm
> > > + * usage in all CI tests is found to fail gracefully on
> > > + * ENOSPC, it is safer to avoid large allocations for now.
> > > + *
> > > + * https://lore.kernel.org/qemu-devel/875xuwg4mx@suse.de/
> > > + */
> > > +g_test_skip("shm tests are not supported in Gitlab CI 
> > > environment");
&g

Re: [PATCH v2 07/18] monitor: Simplify fdset and fd removal

2024-05-31 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:37PM -0300, Fabiano Rosas wrote:
> Remove fds right away instead of setting the ->removed flag. We don't
> need the extra complexity of having a cleanup function reap the
> removed entries at a later time.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT

2024-05-31 Thread Peter Xu
On Fri, May 31, 2024 at 12:42:05PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, May 23, 2024 at 04:05:43PM -0300, Fabiano Rosas wrote:
> >> We're about to enable the use of O_DIRECT in the migration code and
> >> due to the alignment restrictions imposed by filesystems we need to
> >> make sure the flag is only used when doing aligned IO.
> >> 
> >> The migration will do parallel IO to different regions of a file, so
> >> we need to use more than one file descriptor. Those cannot be obtained
> >> by duplicating (dup()) since duplicated file descriptors share the
> >> file status flags, including O_DIRECT. If one migration channel does
> >> unaligned IO while another sets O_DIRECT to do aligned IO, the
> >> filesystem would fail the unaligned operation.
> >> 
> >> The add-fd QMP command along with the fdset code are specifically
> >> designed to allow the user to pass a set of file descriptors with
> >> different access flags into QEMU to be later fetched by code that
> >> needs to alternate between those flags when doing IO.
> >> 
> >> Extend the fdset matching to behave the same with the O_DIRECT flag.
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >
> > Reviewed-by: Peter Xu 
> >
> > One thing I am confused but totally irrelevant to this specific change.
> >
> > I wonder why do we need dupfds at all, and why client needs to remove-fd at
> > all.
> 
> This answer was lost to time.
> 
> >
> > It's about what would go wrong if qmp client only add-fd, then if it's
> > consumed by anything, it gets moved from "fds" list to "dupfds" list.  The
> > thing is I don't expect the client should pass over any fd that will not be
> > consumed.  Then if it's always consumed, why bother dup() at all, and why
> > bother an explicit remove-fd.
> 
> With the lack of documentation, I can only imagine the code was
> initially written to be more flexible, but we ended up never needing the
> extra flexibility. Maybe we could change that transparently in the
> future and deprecate qmp_remove_fd(). I'm thinking, even for the
> mapped-ram work, we might not need libvirt to call that function ever.

Good to know I'm not the only one thinking that.

It's okay - after your cleanup it's much better at least to me.  The only
thing to avoid these, AFAIU, is more careful design level reviews in the
future, and more documents state the facts and keep in history (and
obviously why remove-fd was needed was not documented).  Now we carry them.

-- 
Peter Xu




Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds

2024-05-31 Thread Peter Xu
On Fri, May 31, 2024 at 12:25:52PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, May 23, 2024 at 04:05:36PM -0300, Fabiano Rosas wrote:
> >> We've been up until now cleaning up any file descriptors that have
> >> been passed into QEMU and never duplicated[1,2]. A file descriptor
> >> without duplicates indicates that no part of QEMU has made use of
> >> it. This approach is starting to show some cracks now that we're
> >> starting to consume fds from the migration code:
> >> 
> >> - Doing cleanup every time the last monitor connection closes works to
> >>   reap unused fds, but also has the side effect of forcing the
> >>   management layer to pass the file descriptors again in case of a
> >>   disconnect/re-connect, if that happened to be the only monitor
> >>   connection.
> >> 
> >>   Another side effect is that removing an fd with qmp_remove_fd() is
> >>   effectively delayed until the last monitor connection closes.
> >> 
> >>   The reliance on mon_refcount is also problematic because it's racy.
> >> 
> >> - Checking runstate_is_running() skips the cleanup unless the VM is
> >>   running and avoids premature cleanup of the fds, but also has the
> >>   side effect of blocking the legitimate removal of an fd via
> >>   qmp_remove_fd() if the VM happens to be in another state.
> >> 
> >>   This affects qmp_remove_fd() and qmp_query_fdsets() in particular
> >>   because requesting a removal at a bad time (guest stopped) might
> >>   cause an fd to never be removed, or to be removed at a much later
> >>   point in time, causing the query command to continue showing the
> >>   supposedly removed fd/fdset.
> >> 
> >> Note that file descriptors that *have* been duplicated are owned by
> >> the code that uses them and will be removed after qemu_close() is
> >> called. Therefore we've decided that the best course of action to
> >> avoid the undesired side-effects is to stop managing non-duplicated
> >> file descriptors.
> >> 
> >> 1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect")
> >> 2- ebe52b592d ("monitor: Prevent removing fd from set during init")
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >> ---
> >>  monitor/fds.c  | 15 ---
> >>  monitor/hmp.c  |  2 --
> >>  monitor/monitor-internal.h |  1 -
> >>  monitor/qmp.c  |  2 --
> >>  4 files changed, 8 insertions(+), 12 deletions(-)
> >> 
> >> diff --git a/monitor/fds.c b/monitor/fds.c
> >> index 101e21720d..f7b98814fa 100644
> >> --- a/monitor/fds.c
> >> +++ b/monitor/fds.c
> >> @@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, 
> >> Error **errp)
> >>  
> >>  static void monitor_fdset_free(MonFdset *mon_fdset)
> >>  {
> >> +/*
> >> + * Only remove an empty fdset. The fds are owned by the user and
> >> + * should have been removed with qmp_remove_fd(). The dup_fds are
> >> + * owned by QEMU and should have been removed with qemu_close().
> >> + */
> >>  if (QLIST_EMPTY(_fdset->fds) && QLIST_EMPTY(_fdset->dup_fds)) 
> >> {
> >>  QLIST_REMOVE(mon_fdset, next);
> >>  g_free(mon_fdset);
> >> @@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
> >>  MonFdsetFd *mon_fdset_fd_next;
> >>  
> >>  QLIST_FOREACH_SAFE(mon_fdset_fd, _fdset->fds, next, 
> >> mon_fdset_fd_next) {
> >> -if ((mon_fdset_fd->removed ||
> >> -(QLIST_EMPTY(_fdset->dup_fds) && mon_refcount == 0)) 
> >> &&
> >> -runstate_is_running()) {
> >> +if (mon_fdset_fd->removed) {
> >
> > I don't know the code well, but I like it.
> >
> >>  monitor_fdset_fd_free(mon_fdset_fd);
> >>  }
> >>  }
> >> @@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void)
> >>  
> >>  QEMU_LOCK_GUARD(_fdsets_lock);
> >>  QLIST_FOREACH_SAFE(mon_fdset, _fdsets, next, mon_fdset_next) {
> >> -monitor_fdset_cleanup(mon_fdset);
> >> +monitor_fdset_free(mon_fdset);
> >>  }
> >>  }
> >>  
> >> @@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd)
> >>  if (mon_fdset_fd_dup->

Re: [PATCH v2 4/4] tests/qtest/migration-test: Add a postcopy memfile test

2024-05-31 Thread Peter Xu
up_printf(
>  "-object memory-backend-file,id=mem0,size=%s"
> @@ -751,12 +808,6 @@ static int test_migrate_start(QTestState **from, 
> QTestState **to,
>  kvm_opts = ",dirty-ring-size=4096";
>  }
>  
> -if (!qtest_has_machine(machine_alias)) {
> -g_autofree char *msg = g_strdup_printf("machine %s not supported", 
> machine_alias);
> -g_test_skip(msg);
> -return -1;
> -}
> -
>  machine = resolve_machine_version(machine_alias, QEMU_ENV_SRC,
>QEMU_ENV_DST);
>  
> @@ -807,7 +858,7 @@ static int test_migrate_start(QTestState **from, 
> QTestState **to,
>   * Remove shmem file immediately to avoid memory leak in test failed 
> case.
>   * It's valid because QEMU has already opened this file
>   */
> -if (args->use_memfile) {
> +if (args->use_memfile || args->use_shm_memfile) {
>  unlink(memfile_path);
>  }
>  
> @@ -1275,6 +1326,15 @@ static void test_postcopy(void)
>  test_postcopy_common();
>  }
>  
> +static void test_postcopy_memfile(void)
> +{

IMHO the defined(__NR_userfaultfd) should be here to guard if needed.

Or rather, we don't need to care about uffd yet? As what we already do with
test_postcopy().

I'm guessing the test just doesn't run on !linux, while compilation always
works with/without that.

Thanks,

> +MigrateCommon args = {
> +.start.use_shm_memfile = true,
> +};
> +
> +test_postcopy_common();
> +}
> +
>  static void test_postcopy_suspend(void)
>  {
>  MigrateCommon args = {
> @@ -3441,6 +3501,7 @@ int main(int argc, char **argv)
>  
>  if (has_uffd) {
>  migration_test_add("/migration/postcopy/plain", test_postcopy);
> +migration_test_add("/migration/postcopy/memfile", 
> test_postcopy_memfile);
>  migration_test_add("/migration/postcopy/recovery/plain",
> test_postcopy_recovery);
>  migration_test_add("/migration/postcopy/preempt/plain",
> -- 
> 2.43.0
> 

-- 
Peter Xu




Re: [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT

2024-05-30 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:43PM -0300, Fabiano Rosas wrote:
> We're about to enable the use of O_DIRECT in the migration code and
> due to the alignment restrictions imposed by filesystems we need to
> make sure the flag is only used when doing aligned IO.
> 
> The migration will do parallel IO to different regions of a file, so
> we need to use more than one file descriptor. Those cannot be obtained
> by duplicating (dup()) since duplicated file descriptors share the
> file status flags, including O_DIRECT. If one migration channel does
> unaligned IO while another sets O_DIRECT to do aligned IO, the
> filesystem would fail the unaligned operation.
> 
> The add-fd QMP command along with the fdset code are specifically
> designed to allow the user to pass a set of file descriptors with
> different access flags into QEMU to be later fetched by code that
> needs to alternate between those flags when doing IO.
> 
> Extend the fdset matching to behave the same with the O_DIRECT flag.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

One thing I am confused but totally irrelevant to this specific change.

I wonder why do we need dupfds at all, and why client needs to remove-fd at
all.

It's about what would go wrong if qmp client only add-fd, then if it's
consumed by anything, it gets moved from "fds" list to "dupfds" list.  The
thing is I don't expect the client should pass over any fd that will not be
consumed.  Then if it's always consumed, why bother dup() at all, and why
bother an explicit remove-fd.

-- 
Peter Xu




Re: [PATCH v2 11/18] migration/multifd: Add direct-io support

2024-05-30 Thread Peter Xu
ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
> diff --git a/migration/file.h b/migration/file.h
> index 7699c04677..9f71e87f74 100644
> --- a/migration/file.h
> +++ b/migration/file.h
> @@ -20,7 +20,6 @@ void file_start_outgoing_migration(MigrationState *s,
>  int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp);
>  void file_cleanup_outgoing_migration(void);
>  bool file_send_channel_create(gpointer opaque, Error **errp);
> -void file_create_incoming_channels(QIOChannel *ioc, Error **errp);
>  int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
>  int niov, RAMBlock *block, Error **errp);
>  int multifd_file_recv_data(MultiFDRecvParams *p, Error **errp);
> diff --git a/migration/migration.c b/migration/migration.c
> index e1b269624c..e03c80b3aa 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -155,6 +155,16 @@ static bool migration_needs_seekable_channel(void)
>  return migrate_mapped_ram();
>  }
>  
> +static bool migration_needs_extra_fds(void)
> +{
> +/*
> + * When doing direct-io, multifd requires two different,
> + * non-duplicated file descriptors so we can use one of them for
> + * unaligned IO.
> + */
> +return migrate_multifd() && migrate_direct_io();
> +}
> +
>  static bool transport_supports_seeking(MigrationAddress *addr)
>  {
>  if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
> @@ -164,6 +174,12 @@ static bool transport_supports_seeking(MigrationAddress 
> *addr)
>  return false;
>  }
>  
> +static bool transport_supports_extra_fds(MigrationAddress *addr)
> +{
> +/* file: works because QEMU can open it multiple times */
> +return addr->transport == MIGRATION_ADDRESS_TYPE_FILE;
> +}
> +
>  static bool
>  migration_channels_and_transport_compatible(MigrationAddress *addr,
>          Error **errp)
> @@ -180,6 +196,13 @@ 
> migration_channels_and_transport_compatible(MigrationAddress *addr,
>  return false;
>  }
>  
> +if (migration_needs_extra_fds() &&
> +!transport_supports_extra_fds(addr)) {
> +error_setg(errp,
> +   "Migration requires a transport that allows for extra fds 
> (e.g. file)");
> +return false;
> +}
> +
>  return true;
>  }
>  
> -- 
> 2.35.3
> 
> 

-- 
Peter Xu




Re: [PATCH v2 10/18] migration: Add direct-io parameter

2024-05-30 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:40PM -0300, Fabiano Rosas wrote:
> Add the direct-io migration parameter that tells the migration code to
> use O_DIRECT when opening the migration stream file whenever possible.
> 
> This is currently only used with the mapped-ram migration that has a
> clear window guaranteed to perform aligned writes.
> 
> Acked-by: Markus Armbruster 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file

2024-05-30 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:39PM -0300, Fabiano Rosas wrote:
> We want to make use of the Error object to report fdset errors from
> qemu_open_internal() and passing the error pointer to qemu_open_old()
> would require changing all callers. Move the file channel to the new
> API instead.
> 
> Reviewed-by: Daniel P. Berrangé 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add

2024-05-30 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:38PM -0300, Fabiano Rosas wrote:
> I'm keeping the EACCES because callers expect to be able to look at
> errno.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds

2024-05-30 Thread Peter Xu
nitor_resume(mon);
>  }
>  qemu_mutex_unlock(>mon_lock);
> -mon_refcount++;
>  break;
>  
>  case CHR_EVENT_CLOSED:
> -mon_refcount--;
>  monitor_fdsets_cleanup();
>  break;
>  
> diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
> index 252de85681..cb628f681d 100644
> --- a/monitor/monitor-internal.h
> +++ b/monitor/monitor-internal.h
> @@ -168,7 +168,6 @@ extern bool qmp_dispatcher_co_shutdown;
>  extern QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
>  extern QemuMutex monitor_lock;
>  extern MonitorList mon_list;
> -extern int mon_refcount;
>  
>  extern HMPCommand hmp_cmds[];
>  
> diff --git a/monitor/qmp.c b/monitor/qmp.c
> index a239945e8d..5e538f34c0 100644
> --- a/monitor/qmp.c
> +++ b/monitor/qmp.c
> @@ -466,7 +466,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent 
> event)
>  data = qmp_greeting(mon);
>  qmp_send_response(mon, data);
>  qobject_unref(data);
> -mon_refcount++;
>  break;
>  case CHR_EVENT_CLOSED:
>  /*
> @@ -479,7 +478,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent 
> event)
>  json_message_parser_destroy(>parser);
>  json_message_parser_init(>parser, handle_qmp_command,
>   mon, NULL);
> -mon_refcount--;
>  monitor_fdsets_cleanup();
>  break;
>  case CHR_EVENT_BREAK:

I like this too when mon_refcount can be dropped.  It turns out I like this
patch and the next a lot, and I hope nothing will break.

Aside, you seem to have forgot removal of the "int mon_refcount" in
monitor.c.

-- 
Peter Xu




Re: [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free

2024-05-30 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:35PM -0300, Fabiano Rosas wrote:
> Introduce two new functions to remove and free no longer used fds and
> fdsets.
> 
> We need those to decouple the remove/free routines from
> monitor_fdset_cleanup() which will go away in the next patches.
> 
> The new functions:
> 
> - monitor_fdset_free() will be used when a monitor connection closes
>   and when an fd is removed to cleanup any fdset that is now empty.
> 
> - monitor_fdset_fd_free() will be used to remove one or more fds that
>   have been explicitly targeted by qmp_remove_fd().
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

One nitpick below.

> ---
>  monitor/fds.c | 26 ++
>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/monitor/fds.c b/monitor/fds.c
> index fb9f58c056..101e21720d 100644
> --- a/monitor/fds.c
> +++ b/monitor/fds.c
> @@ -167,6 +167,22 @@ int monitor_get_fd(Monitor *mon, const char *fdname, 
> Error **errp)
>  return -1;
>  }
>  
> +static void monitor_fdset_free(MonFdset *mon_fdset)
> +{
> +if (QLIST_EMPTY(_fdset->fds) && QLIST_EMPTY(_fdset->dup_fds)) {
> +QLIST_REMOVE(mon_fdset, next);
> +g_free(mon_fdset);
> +}
> +}

Would monitor_fdset_free_if_empty() (or similar) slightly better?

static void monitor_fdset_free(MonFdset *mon_fdset)
{
QLIST_REMOVE(mon_fdset, next);
g_free(mon_fdset);
}

static void monitor_fdset_free_if_empty(MonFdset *mon_fdset)
{
if (QLIST_EMPTY(_fdset->fds) && QLIST_EMPTY(_fdset->dup_fds)) {
monitor_fdset_free(mon_fdset);
}
}

> +
> +static void monitor_fdset_fd_free(MonFdsetFd *mon_fdset_fd)
> +{
> +close(mon_fdset_fd->fd);
> +g_free(mon_fdset_fd->opaque);
> +QLIST_REMOVE(mon_fdset_fd, next);
> +g_free(mon_fdset_fd);
> +}
> +
>  static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>  {
>  MonFdsetFd *mon_fdset_fd;
> @@ -176,17 +192,11 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>  if ((mon_fdset_fd->removed ||
>  (QLIST_EMPTY(_fdset->dup_fds) && mon_refcount == 0)) &&
>  runstate_is_running()) {
> -close(mon_fdset_fd->fd);
> -g_free(mon_fdset_fd->opaque);
> -QLIST_REMOVE(mon_fdset_fd, next);
> -g_free(mon_fdset_fd);
> +monitor_fdset_fd_free(mon_fdset_fd);
>  }
>  }
>  
> -if (QLIST_EMPTY(_fdset->fds) && QLIST_EMPTY(_fdset->dup_fds)) {
> -QLIST_REMOVE(mon_fdset, next);
> -g_free(mon_fdset);
> -}
> +monitor_fdset_free(mon_fdset);
>  }
>  
>  void monitor_fdsets_cleanup(void)
> -- 
> 2.35.3
> 

-- 
Peter Xu




Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-30 Thread Peter Xu
On Thu, May 30, 2024 at 01:17:05PM -0400, Steven Sistare wrote:
> On 5/28/2024 12:42 PM, Peter Xu wrote:
> > On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
> > > On 5/27/2024 1:45 PM, Peter Xu wrote:
> > > > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> > > > > I understand, thanks.  If I can help with any of your todo list,
> > > > > just ask - steve
> > > > 
> > > > Thanks for offering the help, Steve.  Started looking at this today, 
> > > > then I
> > > > found that I miss something high-level.  Let me ask here, and let me
> > > > apologize already for starting to throw multiple questions..
> > > > 
> > > > IIUC the whole idea of this patchset is to allow efficient QEMU 
> > > > upgrade, in
> > > > this case not host kernel but QEMU-only, and/or upper.
> > > > 
> > > > Is there any justification on why the complexity is needed here?  It 
> > > > looks
> > > > to me this one is more involved than cpr-reboot, so I'm thinking how 
> > > > much
> > > > we can get from the complexity, and whether it's worthwhile.  1000+ LOC 
> > > > is
> > > > the min support, and if we even expect more to come, that's really
> > > > important, IMHO.
> > > > 
> > > > For example, what's the major motivation of this whole work?  Is that 
> > > > more
> > > > on performance, or is it more for supporting the special devices like 
> > > > VFIO
> > > > which we used to not support, or something else?  I can't find them in
> > > > whatever cover letter I can find, including this one.
> > > > 
> > > > Firstly, regarding performance, IMHO it'll be always nice to share even
> > > > some very fundamental downtime measurement comparisons using the new 
> > > > exec
> > > > mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> > > > have some number on hand when you started working on this feature years
> > > > ago?  Or maybe some old links on the list would help too, as I didn't
> > > > follow this work since the start.
> > > > 
> > > > On VFIO, IIUC you started out this project without VFIO migration being
> > > > there.  Now we have VFIO migration so not sure how much it would work 
> > > > for
> > > > the upgrade use case. Even with current VFIO migration, we may not want 
> > > > to
> > > > migrate device states for a local upgrade I suppose, as that can be a 
> > > > lot
> > > > depending on the type of device assigned.  However it'll be nice to 
> > > > discuss
> > > > this too if this is the major purpose of the series.
> > > > 
> > > > I think one other challenge on QEMU upgrade with VFIO devices is that 
> > > > the
> > > > dest QEMU won't be able to open the VFIO device when the src QEMU is 
> > > > still
> > > > using it as the owner.  IIUC this is a similar condition where QEMU 
> > > > wants
> > > > to have proper ownership transfer of a shared block device, and AFAIR 
> > > > right
> > > > now we resolved that issue using some form of file lock on the image 
> > > > file.
> > > > In this case it won't easily apply to a VFIO dev fd, but maybe we still
> > > > have other approaches, not sure whether you investigated any.  E.g. 
> > > > could
> > > > the VFIO handle be passed over using unix scm rights?  I think this 
> > > > might
> > > > remove one dependency of using exec which can cause quite some 
> > > > difference
> > > > v.s. a generic migration (from which regard, cpr-reboot is still a 
> > > > pretty
> > > > generic migration).
> > > > 
> > > > You also mentioned vhost/tap, is that also a major goal of this series 
> > > > in
> > > > the follow up patchsets?  Is this a problem only because this solution 
> > > > will
> > > > do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> > > > vhost/tap fds when boot?
> > > > 
> > > > Meanwhile, could you elaborate a bit on the implication on chardevs?  
> > > > From
> > > > what I read in the doc update it looks like a major part of work in the
> > > > future, but I don't yet understand the issue..  Is it also relevant t

Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-05-30 Thread Peter Xu
On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote:
> On 5/29/2024 3:25 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
> > > On 5/28/2024 5:44 PM, Peter Xu wrote:
> > > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> > > > > Preserve fields of RAMBlocks that allocate their host memory during 
> > > > > CPR so
> > > > > the RAM allocation can be recovered.
> > > > 
> > > > This sentence itself did not explain much, IMHO.  QEMU can share memory
> > > > using fd based memory already of all kinds, as long as the memory 
> > > > backend
> > > > is path-based it can be shared by sharing the same paths to dst.
> > > > 
> > > > This reads very confusing as a generic concept.  I mean, QEMU migration
> > > > relies on so many things to work right.  We mostly asks the users to 
> > > > "use
> > > > exactly the same cmdline for src/dst QEMU unless you know what you're
> > > > doing", otherwise many things can break.  That should also include 
> > > > ramblock
> > > > being matched between src/dst due to the same cmdlines provided on both
> > > > sides.  It'll be confusing to mention this when we thought the ramblocks
> > > > also rely on that fact.
> > > > 
> > > > So IIUC this sentence should be dropped in the real patch, and I'll try 
> > > > to
> > > > guess the real reason with below..
> > > 
> > > The properties of the implicitly created ramblocks must be preserved.
> > > The defaults can and do change between qemu releases, even when the 
> > > command-line
> > > parameters do not change for the explicit objects that cause these 
> > > implicit
> > > ramblocks to be created.
> > 
> > AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
> > have an example?  Would that already cause issue when migrate?
> 
> Alignment has changed, and used_length vs max_length changed when
> resizeable ramblocks were introduced.  I have dealt with these issues
> while supporting cpr for our internal use, and the learned lesson is to
> explicitly communicate the creation-time parameters to new qemu.

Why used_length can change?  I'm looking at ram_mig_ram_block_resized():

if (!migration_is_idle()) {
/*
 * Precopy code on the source cannot deal with the size of RAM blocks
 * changing at random points in time - especially after sending the
 * RAM block sizes in the migration stream, they must no longer change.
 * Abort and indicate a proper reason.
 */
error_setg(, "RAM block '%s' resized during precopy.", rb->idstr);
migration_cancel(err);
error_free(err);
}

We sent used_length upfront of a migration during SETUP phase.  Looks like
what you're describing can be something different, though?

Regarding to rb->align: isn't that mostly a constant, reflecting the MR's
alignment?  It's set when ramblock is created IIUC:

rb->align = mr->align;

When will the alignment change?

> 
> These are not an issue for migration because the ramblock is re-created
> and the data copied into the new memory.
> 
> > > > > Mirror the mr->align field in the RAMBlock to simplify the vmstate.
> > > > > Preserve the old host address, even though it is immediately 
> > > > > discarded,
> > > > > as it will be needed in the future for CPR with iommufd.  Preserve
> > > > > guest_memfd, even though CPR does not yet support it, to maintain 
> > > > > vmstate
> > > > > compatibility when it becomes supported.
> > > > 
> > > > .. It could be about the vfio vaddr update feature that you mentioned 
> > > > and
> > > > only for iommufd (as IIUC vfio still relies on iova ranges, then it 
> > > > won't
> > > > help here)?
> > > > 
> > > > If so, IMHO we should have this patch (or any variance form) to be there
> > > > for your upcoming vfio support.  Keeping this around like this will make
> > > > the series harder to review.  Or is it needed even before VFIO?
> > > 
> > > This patch is needed independently of vfio or iommufd.
> > > 
> > > guest_memfd is independent of vfio or iommufd.  It is a recent addition
> > > which I have not tried to support, but I added this placeholder field
> > > to it can be supported in the future without adding a new field later
> > > and

Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-05-30 Thread Peter Xu
On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote:
> On 5/29/2024 3:14 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
> > > > > diff --git a/system/memory.c b/system/memory.c
> > > > > index 49f1cb2..ca04a0e 100644
> > > > > --- a/system/memory.c
> > > > > +++ b/system/memory.c
> > > > > @@ -1552,8 +1552,9 @@ bool 
> > > > > memory_region_init_ram_nomigrate(MemoryRegion *mr,
> > > > >  uint64_t size,
> > > > >  Error **errp)
> > > > >{
> > > > > +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> > > > 
> > > > If there's a machine option to "use memfd for allocations", then it's
> > > > shared mem... Hmm..
> > > > 
> > > > It is a bit confusing to me in quite a few levels:
> > > > 
> > > > - Why memory allocation method will be defined by a machine 
> > > > property,
> > > >   even if we have memory-backend-* which should cover everything?
> > > 
> > > Some memory regions are implicitly created, and have no explicit 
> > > representation
> > > on the qemu command line.  memfd-alloc affects those.
> > > 
> > > More generally, memfd-alloc affects all ramblock allocations that are
> > > not explicitly represented by memory-backend object.  Thus the simple
> > > command line "qemu -m 1G" does not explicitly describe an object, so it
> > > goes through the anonymous allocation path, and is affected by 
> > > memfd-alloc.
> > 
> > Can we simply now allow "qemu -m 1G" to work for cpr-exec?
> 
> I assume you meant "simply not allow".
> 
> Yes, I could do that, but I would need to explicitly add code to exclude this
> case, and add a blocker.  Right now it "just works" for all paths that lead to
> ram_block_alloc_host, without any special logic at the memory-backend level.
> And, I'm not convinced that simplifies the docs, as now I would need to tell
> the user that "-m 1G" and similar constructions do not work with cpr.
> 
> I can try to clarify the doc for -memfd-alloc as currently defined.

Why do we need to keep cpr working for existing qemu cmdlines?  We'll
already need to add more new cmdline options already anyway, right?

cpr-reboot wasn't doing it, and that made sense to me, so that new features
will require the user to opt-in for it, starting with changing its
cmdlines.

> 
> > AFAIU that's
> > what we do with cpr-reboot: we ask the user to specify the right things to
> > make other thing work.  Otherwise it won't.
> > 
> > > 
> > > Internally, create_default_memdev does create a memory-backend object.
> > > That is what my doc comment above refers to:
> > >Any associated memory-backend objects are created with share=on
> > > 
> > > An explicit "qemu -object memory-backend-*" is not affected by 
> > > memfd-alloc.
> > > 
> > > The qapi comments in patch "migration: cpr-exec mode" attempt to say all 
> > > that:
> > > 
> > > +# Memory backend objects must have the share=on attribute, and
> > > +# must be mmap'able in the new QEMU process.  For example,
> > > +# memory-backend-file is acceptable, but memory-backend-ram is
> > > +# not.
> > > +#
> > > +# The VM must be started with the '-machine memfd-alloc=on'
> > > +# option.  This causes implicit ram blocks -- those not explicitly
> > > +# described by a memory-backend object -- to be allocated by
> > > +# mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> > > +# RAM when it is specified without a memory-backend object.
> > 
> > VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
> > memory-backend-file,share=on propertly, these should be the only outliers?
> > 
> > Are these important enough for the downtime?  Can we put them into the
> > migrated image alongside with the rest device states?
> 
> It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
> The pages must remain pinned during CPR to support ongoing DMA activity
> which could target those pages (which we do not quiesce), and the same
> physical pages must be used for the ramblocks in the new qemu process.

Ah ok, yes DMA can happen on the fly.

Guest mem is definitely the major DMA target and that can b

Re: [PATCH V1 07/26] migration: VMStateId

2024-05-30 Thread Peter Xu
On Thu, May 30, 2024 at 01:11:26PM -0400, Steven Sistare wrote:
> On 5/29/2024 2:53 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote:
> > > How about a more general name for the type:
> > > 
> > > migration/misc.h
> > >  typedef char (MigrationId)[256];
> > 
> > How about qemu/typedefs.h?  Not sure whether it's applicable. Markus (in
> > the loop) may have a better idea.
> > 
> > Meanwhile, s/MigrationID/IDString/?
> 
> typedefs.h has a different purpose; giving short names to types
> defined in internal include files.
> 
> This id is specific to migration, so I still think its name should reflect
> migration and it belongs in some include/migration/*.h file.
> 
> ramblocks and migration are already closely related.  There is nothing wrong
> with including a migration header in ramblock.h so it can use a migration 
> type.
> We already have:
>   include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h"
>   include/hw/display/ramfb.h:#include "migration/vmstate.h"
>   include/hw/input/pl050.h:#include "migration/vmstate.h"
>   include/hw/pci/shpc.h:#include "migration/vmstate.h"
>   include/hw/virtio/virtio.h:#include "migration/vmstate.h"
>   include/hw/hyperv/vmbus.h:#include "migration/vmstate.h"
> 
> The 256 byte magic length already appears in too many places, and my code
> would add more occurrences, so I really think that abstracting this type
> would be cleaner.

I agree having a typedef is nicer, but I don't understand why it must be
migration related.  It can be the type QEMU uses to represent any string
based ID, and that's a generic concept to me.

Migration can have a wrapper to process that type, but then migration will
include the generic header in that case, it feels more natural that way?

-- 
Peter Xu




Re: [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset

2024-05-30 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:33PM -0300, Fabiano Rosas wrote:
> Add a test for file migration using fdset. The passing of fds is more
> complex than using a file path. This is also the scenario where it's
> most important we ensure that the initial migration stream offset is
> respected because the fdset interface is the one used by the
> management layer when providing a non empty migration file.
> 
> Note that fd passing is not available on Windows, so anything that
> uses add-fd needs to exclude that platform.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check

2024-05-30 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:32PM -0300, Fabiano Rosas wrote:
> When doing file migration, QEMU accepts an offset that should be
> skipped when writing the migration stream to the file. The purpose of
> the offset is to allow the management layer to put its own metadata at
> the start of the file.
> 
> We have tests for this in migration-test, but only testing that the
> migration stream starts at the correct offset and not that it actually
> leaves the data intact. Unsurprisingly, there's been a bug in that
> area that the tests didn't catch.
> 
> Fix the tests to write some data to the offset region and check that
> it's actually there after the migration.
> 
> While here, switch to using g_get_file_contents() which is more
> portable than mmap().
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2 01/18] migration: Fix file migration with fdset

2024-05-30 Thread Peter Xu
On Thu, May 23, 2024 at 04:05:31PM -0300, Fabiano Rosas wrote:
> When the "file:" migration support was added we missed the special
> case in the qemu_open_old implementation that allows for a particular
> file name format to be used to refer to a set of file descriptors that
> have been previously provided to QEMU via the add-fd QMP command.
> 
> When using this fdset feature, we should not truncate the migration
> file because being given an fd means that the management layer is in
> control of the file and will likely already have some data written to
> it. This is further indicated by the presence of the 'offset'
> argument, which indicates the start of the region where QEMU is
> allowed to write.
> 
> Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
> call, which will take the offset into consideration.
> 
> Fixes: 385f510df5 ("migration: file URI offset")
> Suggested-by: Daniel P. Berrangé 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

Right below the change, did we forget to free the ioc if
qio_channel_io_seek() failed?

> ---
>  migration/file.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index ab18ba505a..ba5b5c44ff 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -84,12 +84,19 @@ void file_start_outgoing_migration(MigrationState *s,
>  
>  trace_migration_file_outgoing(filename);
>  
> -fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
> - 0600, errp);
> +fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY, 0600, 
> errp);
>  if (!fioc) {
>  return;
>  }
>  
> +if (ftruncate(fioc->fd, offset)) {
> +error_setg_errno(errp, errno,
> + "failed to truncate migration file to offset %" 
> PRIx64,
> + offset);
> +object_unref(OBJECT(fioc));
> +return;
> +}
> +
>  outgoing_args.fname = g_strdup(filename);
>  
>  ioc = QIO_CHANNEL(fioc);
> -- 
> 2.35.3
> 

-- 
Peter Xu




Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-05-29 Thread Peter Xu
On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
> On 5/28/2024 5:44 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> > > Preserve fields of RAMBlocks that allocate their host memory during CPR so
> > > the RAM allocation can be recovered.
> > 
> > This sentence itself did not explain much, IMHO.  QEMU can share memory
> > using fd based memory already of all kinds, as long as the memory backend
> > is path-based it can be shared by sharing the same paths to dst.
> > 
> > This reads very confusing as a generic concept.  I mean, QEMU migration
> > relies on so many things to work right.  We mostly asks the users to "use
> > exactly the same cmdline for src/dst QEMU unless you know what you're
> > doing", otherwise many things can break.  That should also include ramblock
> > being matched between src/dst due to the same cmdlines provided on both
> > sides.  It'll be confusing to mention this when we thought the ramblocks
> > also rely on that fact.
> > 
> > So IIUC this sentence should be dropped in the real patch, and I'll try to
> > guess the real reason with below..
> 
> The properties of the implicitly created ramblocks must be preserved.
> The defaults can and do change between qemu releases, even when the 
> command-line
> parameters do not change for the explicit objects that cause these implicit
> ramblocks to be created.

AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
have an example?  Would that already cause issue when migrate?

> 
> > > Mirror the mr->align field in the RAMBlock to simplify the vmstate.
> > > Preserve the old host address, even though it is immediately discarded,
> > > as it will be needed in the future for CPR with iommufd.  Preserve
> > > guest_memfd, even though CPR does not yet support it, to maintain vmstate
> > > compatibility when it becomes supported.
> > 
> > .. It could be about the vfio vaddr update feature that you mentioned and
> > only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
> > help here)?
> > 
> > If so, IMHO we should have this patch (or any variance form) to be there
> > for your upcoming vfio support.  Keeping this around like this will make
> > the series harder to review.  Or is it needed even before VFIO?
> 
> This patch is needed independently of vfio or iommufd.
> 
> guest_memfd is independent of vfio or iommufd.  It is a recent addition
> which I have not tried to support, but I added this placeholder field
> to it can be supported in the future without adding a new field later
> and maintaining backwards compatibility.

Is guest_memfd the only user so far, then?  If so, would it be possible we
split it as a separate effort on top of the base cpr-exec support?

-- 
Peter Xu




Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-05-29 Thread Peter Xu
On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
> > > diff --git a/system/memory.c b/system/memory.c
> > > index 49f1cb2..ca04a0e 100644
> > > --- a/system/memory.c
> > > +++ b/system/memory.c
> > > @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion 
> > > *mr,
> > > uint64_t size,
> > > Error **errp)
> > >   {
> > > +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> > 
> > If there's a machine option to "use memfd for allocations", then it's
> > shared mem... Hmm..
> > 
> > It is a bit confusing to me in quite a few levels:
> > 
> >- Why memory allocation method will be defined by a machine property,
> >  even if we have memory-backend-* which should cover everything?
> 
> Some memory regions are implicitly created, and have no explicit 
> representation
> on the qemu command line.  memfd-alloc affects those.
> 
> More generally, memfd-alloc affects all ramblock allocations that are
> not explicitly represented by memory-backend object.  Thus the simple
> command line "qemu -m 1G" does not explicitly describe an object, so it
> goes through the anonymous allocation path, and is affected by memfd-alloc.

Can we simply now allow "qemu -m 1G" to work for cpr-exec?  AFAIU that's
what we do with cpr-reboot: we ask the user to specify the right things to
make other thing work.  Otherwise it won't.

> 
> Internally, create_default_memdev does create a memory-backend object.
> That is what my doc comment above refers to:
>   Any associated memory-backend objects are created with share=on
> 
> An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.
> 
> The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:
> 
> +# Memory backend objects must have the share=on attribute, and
> +# must be mmap'able in the new QEMU process.  For example,
> +# memory-backend-file is acceptable, but memory-backend-ram is
> +# not.
> +#
> +# The VM must be started with the '-machine memfd-alloc=on'
> +# option.  This causes implicit ram blocks -- those not explicitly
> +# described by a memory-backend object -- to be allocated by
> +# mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> +# RAM when it is specified without a memory-backend object.

VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
memory-backend-file,share=on propertly, these should be the only outliers?

Are these important enough for the downtime?  Can we put them into the
migrated image alongside with the rest device states?

> 
> >- Even if we have such a machine property, why setting "memfd" will
> >  always imply shared?  why not private?  After all it's not called
> >  "memfd-shared-alloc", and we can create private mappings using
> >  e.g. memory-backend-memfd,share=off.
> 
> There is no use case for memfd-alloc with share=off, so no point IMO in
> making the option more verbose.

Unfortunately this fact doesn't make the property easier to understand. :-(

> For cpr, the mapping with all its modifications must be visible to new
> qemu when qemu mmaps it.

So this might be the important part - do you mean migrating
VGA/ROM/... small ramblocks won't work (besides any performance concerns)?
Could you elaborate?

Cpr-reboot already introduced lots of tricky knobs to QEMU.  We may need to
restrict that specialty to minimal, making the interfacing as clear as
possible, or (at least migration) maintainers will start to be soon scared
and running away, if such proposal was not shot down.

In short, I hope when we introduce new knobs for cpr, we shouldn't always
keep cpr-* modes in mind, but consider whenever the user can use it without
cpr-*.  I'm not sure whether it'll be always possible, but we should try.

Thanks,

-- 
Peter Xu




Re: [PATCH V1 07/26] migration: VMStateId

2024-05-29 Thread Peter Xu
On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote:
> How about a more general name for the type:
> 
> migration/misc.h
> typedef char (MigrationId)[256];

How about qemu/typedefs.h?  Not sure whether it's applicable. Markus (in
the loop) may have a better idea.

Meanwhile, s/MigrationID/IDString/?

> 
> exec/ramblock.h
> struct RAMBlock {
> MigrationId idstr;
> 
> migration/savevm.c
> typedef struct CompatEntry {
> MigrationId idstr;
> 
> typedef struct SaveStateEntry {
>     MigrationId idstr;
> 
> 
> - Steve
> 

-- 
Peter Xu




Re: [PATCH V1 05/26] migration: precreate vmstate

2024-05-29 Thread Peter Xu
On Tue, May 28, 2024 at 11:09:49AM -0400, Steven Sistare wrote:
> On 5/27/2024 2:16 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:
> > > Provide the VMStateDescription precreate field to mark objects that must
> > > be loaded on the incoming side before devices have been created, because
> > > they provide properties that will be needed at creation time.  They will
> > > be saved to and loaded from their own QEMUFile, via
> > > qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
> > > functions are not yet called in this patch.  Allow them to be called
> > > before or after normal migration is active, when current_migration and
> > > current_incoming are not valid.
> > > 
> > > Signed-off-by: Steve Sistare 
> > > ---
> > >   include/migration/vmstate.h |  6 
> > >   migration/savevm.c  | 69 
> > > +
> > >   migration/savevm.h  |  3 ++
> > >   3 files changed, 73 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > > index 294d2d8..4691334 100644
> > > --- a/include/migration/vmstate.h
> > > +++ b/include/migration/vmstate.h
> > > @@ -198,6 +198,12 @@ struct VMStateDescription {
> > >* a QEMU_VM_SECTION_START section.
> > >*/
> > >   bool early_setup;
> > > +
> > > +/*
> > > + * Send/receive this object in the precreate migration stream.
> > > + */
> > > +bool precreate;
> > > +
> > >   int version_id;
> > >   int minimum_version_id;
> > >   MigrationPriority priority;
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index 9789823..a30bcd9 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -239,6 +239,7 @@ static SaveState savevm_state = {
> > >   #define SAVEVM_FOREACH(se, entry)\
> > >   QTAILQ_FOREACH(se, _state.handlers, entry)\
> > > +if (!se->vmsd || !se->vmsd->precreate)
> > >   #define SAVEVM_FOREACH_ALL(se, entry)\
> > >   QTAILQ_FOREACH(se, _state.handlers, entry)
> > > @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, 
> > > SaveStateEntry *se,
> > >   }
> > >   }
> > > +static bool send_section_footer(SaveStateEntry *se)
> > > +{
> > > +return (se->vmsd && se->vmsd->precreate) ||
> > > +   migrate_get_current()->send_section_footer;
> > > +}
> > 
> > Does the precreate vmsd "require" the footer?  Or it should also work?
> > IMHO it's less optimal to bind features without good reasons.
> 
> It is not required.  However, IMO we should not treat send-section-footer as
> a fungible feature.  It is strictly an improvement, as was added to catch
> misformated sections.  It is only registered as a feature for backwards
> compatibility with qemu 2.3 and xen.
> 
> For a brand new data stream such as precreate, where we are not constrained
> by backwards compatibility, we should unconditionally use the better protocol,
> and always send the footer.

I see your point, but I still don't think we should mangle these things.
It makes future justification harder on whether section footer should be
sent.

Take example of whatever new feature added for migration like mapped-ram,
we might also want to enforce it by adding "return migrate_mapped_ram() ||
..." but it means we keep growing this with no benefit.

What you worry on "what if this is turned off" isn't a real one: nobody
will turn it off!  We started to deprecate machines, and I had a feeling
it will be enforced at some point by default.

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Peter Xu
Lei,

On Wed, May 29, 2024 at 02:43:46AM +, Gonglei (Arei) wrote:
> For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> rdma_cm and ibverbs create two FDs respectively. The two FDs have
> different responsibilities. rdma_cm fd is used to notify connection
> establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a
> pollin event can be monitored, which means that an rdma_cm event
> occurs. When the verbs fd is directly polled/epolled, only the pollin
> event can be listened, which indicates that a new CQE is generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides
> rdma calls that are completely similar to socket interfaces. However,
> this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and
> writable events for listening to data). Only the rpoll interface provided
> by the RSocket can be used to listen to related events. However, QEMU
> uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept
> API).  And cannot listen to the verbs fd event. Only some hacking methods
> can be used to address this problem.  Do you guys have any ideas? Thanks.

I saw that you mentioned this elsewhere:

> Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(

So what I'm thinking may not make much sense, as I mentioned I don't think
I know rdma at all.. and my idea also has involvement on coroutine stuff
which I also don't know well. But just in case it shed some light in some
form.

IIUC we do iochannel blockings with this no matter for read/write:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

One thing I'm wondering is whether we can provide a new feature bit for
qiochannel, e.g., QIO_CHANNEL_FEATURE_POLL, so that the iochannel can
define its own poll routine rather than using the default when possible.

I think it may not work if it's in a coroutine, as I guess that'll block
other fds from being waked up.  Hence it should look like this:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else if (qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_POLL)) {
qio_channel_poll(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

Maybe we even want to forbid such channel to be used in coroutine already,
as when QIO_CHANNEL_FEATURE_POLL set it may mean that this iochannel simply
won't work with poll() like in rdma's use case.

Then rdma iochannel can implement qio_channel_poll() using rpoll().

There's one other dependent issue here in that I _think_ the migration recv
side is still in a coroutine.. so we may need to move that into a thread
first.  IIRC we don't yet have a major blocker to do that, but I didn't
further check either.  I've put that issue aside just to see whether this
may or may not make sense.

Thanks,

-- 
Peter Xu




Re: [PATCH] tests/qtest/migrate-test: Add a postcopy memfile test

2024-05-29 Thread Peter Xu
On Wed, May 29, 2024 at 09:54:30AM -0300, Fabiano Rosas wrote:
> Nicholas Piggin  writes:
> 
> > Postcopy requires userfaultfd support, which requires tmpfs if a memory
> > file is used.
> >
> > This adds back support for /dev/shm memory files, but adds preallocation
> > to skip environments where that mount is limited in size.
> >
> > Signed-off-by: Nicholas Piggin 
> > ---
> >
> > How about this? This goes on top of the reset of the patches
> > (I'll re-send them all as a series if we can get to some agreement).
> >
> > This adds back the /dev/shm option with preallocation and adds a test
> > case that requires tmpfs.
> 
> Peter has stronger opinions on this than I do. I'll leave it to him to
> decide.

Sorry if I gave that feeling; it's more of a stronger willingness to at
some point enable shmem for QEMU migration, rather than wanting to push
back what Nicholas was trying to do.  Enabling more arch for migration
tests is definitely worthwhile on its own.

Shmem is just some blank spot that IMHO we should start to think about
better coverarge. E.g. it is the only sane way to boot the VM that is able
to do fast qemu upgrades using ignore-shared, that was true even before
Steve's cpr-exec work, which would be much easier than anonymous. And it's
also possible shmem can be (in the next 3-5 years) the 1G page provider to
replace hugetlb for postcopy's sake - this one is far beyond our current
discussion so I won't extend..

IMHO shmem should just be a major backend just like anonymous, and the only
possible file backend we can test in CI - as hugetlb is harder to manage
there.

> 
> Just note that now we're making the CI less deterministic in relation to
> the migration tests. When a test that uses shmem fails, we'll not be
> able to consistently reproduce because the test might not even run
> depending on what has consumed the shmem first.
> 
> Let's also take care that the other consumers of shmem (I think just
> ivshmem-test) are able to cope with the migration-test taking all the
> space, otherwise the CI will still break.

Looks like ivshmem-test only uses 1MB shmem constantly so probably that
will succeed if the migration test will, but true they face the same
challenge and they interfere with each other..  that test sidently pass
(instead of skip) if mktempshm() fails.  I guess we don't have a way to
solidly test shmem as shmem simply may not be around.

For this patch alone personally I'd avoid using "use_uffd_memfile" as the
name, as that's definitely confusing, since shmem can be tested in other
setups too without uffd.  Nicolas, please feel free to move ahead with your
arch enablement series with /tmp if you want to separate the shmem issue.

Thanks,

-- 
Peter Xu




Re: [PATCH] tests/qtest/migrate-test: Use regular file file for shared-memory tests

2024-05-28 Thread Peter Xu
On Wed, May 29, 2024 at 10:05:32AM +1000, Nicholas Piggin wrote:
> I think that's good if you _need_ shm (e.g., for a uffd test), but
> we should permit tests that only require a memory file.

Yes there's no reason to forbid that, it's just that we're not adding new
tests but we can potentially change any future use_shmem to not test shmem
anymore.. instead we test something we don't suggest users to use..

The only concern is a small /dev/shm mount, am I right?  Would it work if
switch to memory-backend-memfd,shared=on?

Thanks,

-- 
Peter Xu




Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-05-28 Thread Peter Xu
On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> Preserve fields of RAMBlocks that allocate their host memory during CPR so
> the RAM allocation can be recovered.

This sentence itself did not explain much, IMHO.  QEMU can share memory
using fd based memory already of all kinds, as long as the memory backend
is path-based it can be shared by sharing the same paths to dst.

This reads very confusing as a generic concept.  I mean, QEMU migration
relies on so many things to work right.  We mostly asks the users to "use
exactly the same cmdline for src/dst QEMU unless you know what you're
doing", otherwise many things can break.  That should also include ramblock
being matched between src/dst due to the same cmdlines provided on both
sides.  It'll be confusing to mention this when we thought the ramblocks
also rely on that fact.

So IIUC this sentence should be dropped in the real patch, and I'll try to
guess the real reason with below..

> Mirror the mr->align field in the RAMBlock to simplify the vmstate.
> Preserve the old host address, even though it is immediately discarded,
> as it will be needed in the future for CPR with iommufd.  Preserve
> guest_memfd, even though CPR does not yet support it, to maintain vmstate
> compatibility when it becomes supported.

.. It could be about the vfio vaddr update feature that you mentioned and
only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
help here)?

If so, IMHO we should have this patch (or any variance form) to be there
for your upcoming vfio support.  Keeping this around like this will make
the series harder to review.  Or is it needed even before VFIO?

Another thing to ask: does this idea also need to rely on some future
iommufd kernel support?  If there's anything that's not merged in current
Linux upstream, this series needs to be marked as RFC, so it's not target
for merging.  This will also be true if this patch is "preparing" for that
work.  It means if this patch only services iommufd purpose, even if it
doesn't require any kernel header to be referenced, we should only merge it
together with the full iommufd support comes later (and that'll be after
iommufd kernel supports land).

Thanks,

-- 
Peter Xu




Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-05-28 Thread Peter Xu
Error **errp)
>  {
> +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;

If there's a machine option to "use memfd for allocations", then it's
shared mem... Hmm..

It is a bit confusing to me in quite a few levels:

  - Why memory allocation method will be defined by a machine property,
even if we have memory-backend-* which should cover everything?

  - Even if we have such a machine property, why setting "memfd" will
always imply shared?  why not private?  After all it's not called
"memfd-shared-alloc", and we can create private mappings using
e.g. memory-backend-memfd,share=off.

>  return memory_region_init_ram_flags_nomigrate(mr, owner, name,
> -  size, 0, errp);
> +  size, flags, errp);
>  }
>  
>  bool memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
> @@ -1713,8 +1714,9 @@ bool memory_region_init_rom_nomigrate(MemoryRegion *mr,
>uint64_t size,
>Error **errp)
>  {
> +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>  if (!memory_region_init_ram_flags_nomigrate(mr, owner, name,
> -size, 0, errp)) {
> +size, flags, errp)) {
>   return false;
>  }
>  mr->readonly = true;
> @@ -1731,6 +1733,7 @@ bool 
> memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>   Error **errp)
>  {
>  Error *err = NULL;
> +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
>  assert(ops);
>  memory_region_init(mr, owner, name, size);
>  mr->ops = ops;
> @@ -1738,7 +1741,7 @@ bool 
> memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>  mr->terminates = true;
>  mr->rom_device = true;
>  mr->destructor = memory_region_destructor_ram;
> -mr->ram_block = qemu_ram_alloc(size, 0, mr, );
> +mr->ram_block = qemu_ram_alloc(size, flags, mr, );
>  if (err) {
>  mr->size = int128_zero();
>  object_unparent(OBJECT(mr));
> diff --git a/system/physmem.c b/system/physmem.c
> index c736af5..36d97ec 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -45,6 +45,7 @@
>  #include "qemu/qemu-print.h"
>  #include "qemu/log.h"
>  #include "qemu/memalign.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -1825,6 +1826,19 @@ static void *ram_block_alloc_host(RAMBlock *rb, Error 
> **errp)
>  if (xen_enabled()) {
>  xen_ram_alloc(rb->offset, rb->max_length, mr, errp);
>  
> +} else if (rb->flags & RAM_SHARED) {
> +if (rb->fd == -1) {
> +mr->align = QEMU_VMALLOC_ALIGN;
> +rb->fd = qemu_memfd_create(rb->idstr, rb->max_length + mr->align,
> +   0, 0, 0, errp);
> +}

We used to have case where RAM_SHARED && rb->fd==-1, I think.

We have some code that checks explicitly on rb->fd against -1 to know
whether it's a fd based.  I'm not sure whether there'll be implications to
affect those codes.

Maybe it's mostly fine, OTOH I worry more on the whole idea.  I'm not sure
whether this is relevant to "we want to be able to share the mem with the
new process", in this case can we simply require the user to use file based
memory backends, rather than such change?

Thanks,

> +if (rb->fd >= 0) {
> +int mfd = rb->fd;
> +qemu_set_cloexec(mfd);
> +host = file_ram_alloc(rb, rb->max_length, mfd, false, 0, errp);
> +trace_qemu_anon_memfd_alloc(rb->idstr, rb->max_length, mfd, 
> host);
> +}
> +
>  } else {
>  host = qemu_anon_ram_alloc(rb->max_length, >align,
> qemu_ram_is_shared(rb),
> @@ -2106,8 +2120,10 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, 
> ram_addr_t maxsz,
>   void *host),
>       MemoryRegion *mr, Error **errp)
>  {
> +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> +flags |= RAM_RESIZEABLE;
>  return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
> -   RAM_RESIZEABLE, mr, errp);
> +   flags, mr, errp);
>  }
>  
>  static void reclaim_ramblock(RAMBlock *block)
> diff --git a/system/trace-events b/system/trace-events
> index f0a80ba..0092734 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -41,3 +41,4 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t 
> sleep_time_us) "CPU[%d] sleep %"P
>  
>  # physmem.c
>  ram_block_create(const char *name, uint32_t flags, int fd, size_t 
> used_length, size_t max_length, size_t align) "%s, flags %u, fd %d, len %lu, 
> maxlen %lu, align %lu"
> +qemu_anon_memfd_alloc(const char *name, size_t size, int fd, void *ptr) "%s 
> size %zu fd %d -> %p"
> -- 
> 1.8.3.1
> 

-- 
Peter Xu




Re: [PATCH v2 2/6] tests/qtest/migration-test: Fix and enable test_ignore_shared

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 10:42:06AM +1000, Nicholas Piggin wrote:
> This test is already starting to bitrot, so first remove it from ifdef
> and fix compile issues. ppc64 transfers about 2MB, so bump the size
> threshold too.
> 
> It was said to be broken on aarch64 but it may have been the limited shm
> size under gitlab CI. The test is now excluded from running on CI so it
> shouldn't cause too much annoyance.
> 
> So let's try enable it.
> 
> Cc: Yury Kotov 
> Cc: Dr. David Alan Gilbert 

Dave's new email is:

d...@treblig.org

Please feel free to use it in a repost.

Thanks,

> Signed-off-by: Nicholas Piggin 
> ---
>  tests/qtest/migration-test.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 04bf1c0092..8247ed98f2 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -1893,14 +1893,15 @@ static void 
> test_precopy_unix_tls_x509_override_host(void)
>  #endif /* CONFIG_TASN1 */
>  #endif /* CONFIG_GNUTLS */
>  
> -#if 0
> -/* Currently upset on aarch64 TCG */
>  static void test_ignore_shared(void)
>  {
>  g_autofree char *uri = g_strdup_printf("unix:%s/migsocket", tmpfs);
>  QTestState *from, *to;
> +MigrateStart args = {
> +.use_shmem = true,
> +};
>  
> -if (test_migrate_start(, , uri, false, true, NULL, NULL)) {
> +if (test_migrate_start(, , uri, )) {
>  return;
>  }
>  
> @@ -1925,11 +1926,11 @@ static void test_ignore_shared(void)
>  wait_for_migration_complete(from);
>  
>  /* Check whether shared RAM has been really skipped */
> -g_assert_cmpint(read_ram_property_int(from, "transferred"), <, 1024 * 
> 1024);
> +g_assert_cmpint(read_ram_property_int(from, "transferred"), <,
> +   4 * 1024 * 1024);
>  
>  test_migrate_end(from, to, true);
>  }
> -#endif
>  
>  static void *
>  test_migrate_xbzrle_start(QTestState *from,
> @@ -3580,7 +3581,8 @@ int main(int argc, char **argv)
>  #endif /* CONFIG_TASN1 */
>  #endif /* CONFIG_GNUTLS */
>  
> -/* migration_test_add("/migration/ignore_shared", test_ignore_shared); */
> +migration_test_add("/migration/ignore_shared", test_ignore_shared);
> +
>  #ifndef _WIN32
>  migration_test_add("/migration/precopy/fd/tcp",
> test_migrate_precopy_fd_socket);
> -- 
> 2.43.0
> 

-- 
Peter Xu




Re: [RFC PATCH 4/4] ci: Add the new migration device tests

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 03:10:48PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Mon, May 27, 2024 at 08:59:00PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Thu, May 23, 2024 at 05:19:22PM -0300, Fabiano Rosas wrote:
> >> >> We have two new migration tests that check cross version
> >> >> compatibility. One uses the vmstate-static-checker.py script to
> >> >> compare the vmstate structures from two different QEMU versions. The
> >> >> other runs a simple migration with a few devices present in the VM, to
> >> >> catch obvious breakages.
> >> >> 
> >> >> Add both tests to the migration-compat-common job.
> >> >> 
> >> >> Signed-off-by: Fabiano Rosas 
> >> >> ---
> >> >>  .gitlab-ci.d/buildtest.yml | 43 +++---
> >> >>  1 file changed, 36 insertions(+), 7 deletions(-)
> >> >> 
> >> >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> >> >> index 91c57efded..bc7ac35983 100644
> >> >> --- a/.gitlab-ci.d/buildtest.yml
> >> >> +++ b/.gitlab-ci.d/buildtest.yml
> >> >> @@ -202,18 +202,47 @@ build-previous-qemu:
> >> >>needs:
> >> >>  - job: build-previous-qemu
> >> >>  - job: build-system-opensuse
> >> >> -  # The old QEMU could have bugs unrelated to migration that are
> >> >> -  # already fixed in the current development branch, so this test
> >> >> -  # might fail.
> >> >> +  # This test is allowed to fail because:
> >> >> +  #
> >> >> +  # - The old QEMU could have bugs unrelated to migration that are
> >> >> +  #   already fixed in the current development branch.
> >> >
> >> > Did you ever hit a real failure with this?  I'm wondering whether we can
> >> > remove this allow_failure thing.
> >> >
> >> 
> >> I haven't. But when it fails we'll go through an entire release cycle
> >> with this thing showing red for every person that runs the CI. Remember,
> >> this is a CI failure to which there's no fix aside from waiting for the
> >> release to happen. Even if we're quick to react and disable the job, I
> >> feel it might create some confusion already.
> >
> > My imagination was if needed we'll get complains and we add that until
> > then for that broken release only, and remove in the next release again.
> >
> >> 
> >> >> +  #
> >> >> +  # - The vmstate-static-checker script trips on renames and other
> >> >> +  #   backward-compatible changes to the vmstate structs.
> >> >
> >> > I think I keep my preference per last time we talked on this. :)
> >> 
> >> Sorry, I'm not trying to force this in any way, I just wrote these to
> >> use in the pull-request and thought I'd put it out there. At the very
> >> least we can have your concerns documented. =)
> >
> > Yep that's fine.  I think we should keep such discussion on the list,
> > especially we have different opinions, while none of us got convinced yet
> > so far. :)
> >
> >> 
> >> > I still think it's too early to involve a test that can report false
> >> > negative.
> >> 
> >> (1)
> >> Well, we haven't seen any false negatives, we've seen fields being
> >> renamed. If that happens, then we'll ask the person to update the
> >> script. Is that not acceptable to you? Or are you thinking about other
> >> sorts of issues?
> >
> > Then question is how to update the script. So far it's treated as failure
> > on rename, even if it's benign. Right now we have this:
> >
> > print("Section \"" + sec + "\",", end=' ')
> > print("Description \"" + desc + "\":", end=' ')
> > print("expected field \"" + s_item["field"] + "\",", end=' ')
> > print("got \"" + d_item["field"] + "\"; skipping rest")
> > bump_taint()
> > break
> >
> > Do you want to introduce a list of renamed vmsd fields in this script and
> > maintain that?  IMHO it's an overkill and unnecessary burden to other
> > developers.
> >
> 
> That's not _

Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 11:10:16AM -0400, Steven Sistare wrote:
> On 5/27/2024 2:31 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
> > > Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
> > > can be saved in the migration stream.  This will be needed for CPR.
> > > 
> > > Signed-off-by: Steve Sistare 
> > 
> > This is really tricky.
> > 
> >  From a first glance, I don't think migrating a VA is valid at all for
> > migration even if with exec.. and looks insane to me for a cross-process
> > migration, which seems to be allowed to use as a generic VMSD helper.. as
> > VA is the address space barrier for different processes and I think it
> > normally even apply to generic execve(), and we're trying to jailbreak for
> > some reason..
> > 
> > It definitely won't work for any generic migration as sizeof(void*) can be
> > different afaict between hosts, e.g. 32bit -> 64bit migrations.
> > 
> > Some description would be really helpful in this commit message,
> > e.g. explain the users and why.  Do we need to poison that for generic VMSD
> > use (perhaps with prefixed underscores)?  I think I'll need to read on the
> > rest to tell..
> 
> Short answer: we never dereference the void* in the new process.  And must 
> not.
> 
> Longer answer:
> 
> During CPR for vfio, each mapped DMA region is re-registered in the new
> process using the new VA.  The ioctl to re-register identifies the mapping
> by IOVA and length.
> 
> The same requirement holds for CPR of iommufd devices.  However, in the
> iommufd framework, IOVA does not uniquely identify a dma mapping, and we
> need to use the old VA as the unique identifier.  The new process
> re-registers each mapping, passing the old VA and new VA to the kernel.
> The old VA is never dereferenced in the new process, we just need its value.
> 
> I suspected that the void* which must not be dereferenced might make people
> uncomfortable.  I have an older version of my code which adds a uint64_t
> field to RAMBlock for recording and migrating the old VA.  The saving and
> loading code is slightly less elegant, but no big deal.  Would you prefer
> that?

I see, thanks for explaining.  Yes that sounds better to me.  Re the
ugliness: is that about a pre_save() plus one extra uint64_t field?  In
that case it looks better comparing to migrating "void*".

I'm trying to read some context on the vaddr remap thing from you, and I
found this:

https://lore.kernel.org/all/y90bvbnrvraceq%2f...@nvidia.com/

So it will work with iommufd now?  Meanwhile, what's the status for mdev?
Looks like it isn't supported yet for both.

Thanks,

-- 
Peter Xu




Re: [PATCH V1 07/26] migration: VMStateId

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 11:10:03AM -0400, Steven Sistare via wrote:
> On 5/27/2024 2:20 PM, Peter Xu wrote:
> > On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote:
> > > Define a type for the 256 byte id string to guarantee the same length is
> > > used and enforced everywhere.
> > > 
> > > Signed-off-by: Steve Sistare 
> > > ---
> > >   include/exec/ramblock.h | 3 ++-
> > >   include/migration/vmstate.h | 2 ++
> > >   migration/savevm.c  | 8 
> > >   migration/vmstate.c | 3 ++-
> > >   4 files changed, 10 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> > > index 0babd10..61deefe 100644
> > > --- a/include/exec/ramblock.h
> > > +++ b/include/exec/ramblock.h
> > > @@ -23,6 +23,7 @@
> > >   #include "cpu-common.h"
> > >   #include "qemu/rcu.h"
> > >   #include "exec/ramlist.h"
> > > +#include "migration/vmstate.h"
> > >   struct RAMBlock {
> > >   struct rcu_head rcu;
> > > @@ -35,7 +36,7 @@ struct RAMBlock {
> > >   void (*resized)(const char*, uint64_t length, void *host);
> > >   uint32_t flags;
> > >   /* Protected by the BQL.  */
> > > -char idstr[256];
> > > +VMStateId idstr;
> > >   /* RCU-enabled, writes protected by the ramlist lock */
> > >   QLIST_ENTRY(RAMBlock) next;
> > >   QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
> > 
> > Hmm.. Don't look like a good idea to include a migration header in
> > ramblock.h?  Is this ramblock change needed for this work?
> 
> Well, entities that are migrated include migration headers, and now that
> includes RAMBlock.  There is precedent:
> 
> 0 include/exec/ramblock.h   26 #include "migration/vmstate.h"
> 1 include/hw/acpi/ich9_tco. 14 #include "migration/vmstate.h"
> 2 include/hw/display/ramfb.  4 #include "migration/vmstate.h"
> 3 include/hw/hyperv/vmbus.h 16 #include "migration/vmstate.h"
> 4 include/hw/input/pl050.h  14 #include "migration/vmstate.h"
> 5 include/hw/pci/shpc.h  7 #include "migration/vmstate.h"
> 6 include/hw/virtio/virtio. 20 #include "migration/vmstate.h"
> 7 include/migration/cpu.h8 #include "migration/vmstate.h"
> 
> Granted, only some of the C files that include ramblock.h need all of 
> vmstate.h.
> I could define VMStateId in a smaller file such as migration/misc.h, or a
> new file migration/vmstateid.h, and include that in ramblock.h.
> Any preference?

One issue here is currently the idstr[] of ramblock is a verbose name of
the ramblock, and logically it doesn't need to have anything to do with
vmstate.

I'll continue to read to see why we need VMStateID here for a ramblock.  So
if it is necessary, maybe the name VMStateID to be used here is misleading?
It'll also be nice to separate the changes, as ramblock.h doesn't belong to
migration subsystem but memory api.

Thanks,

-- 
Peter Xu




Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
> On 5/27/2024 1:45 PM, Peter Xu wrote:
> > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> > > I understand, thanks.  If I can help with any of your todo list,
> > > just ask - steve
> > 
> > Thanks for offering the help, Steve.  Started looking at this today, then I
> > found that I miss something high-level.  Let me ask here, and let me
> > apologize already for starting to throw multiple questions..
> > 
> > IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
> > this case not host kernel but QEMU-only, and/or upper.
> > 
> > Is there any justification on why the complexity is needed here?  It looks
> > to me this one is more involved than cpr-reboot, so I'm thinking how much
> > we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
> > the min support, and if we even expect more to come, that's really
> > important, IMHO.
> > 
> > For example, what's the major motivation of this whole work?  Is that more
> > on performance, or is it more for supporting the special devices like VFIO
> > which we used to not support, or something else?  I can't find them in
> > whatever cover letter I can find, including this one.
> > 
> > Firstly, regarding performance, IMHO it'll be always nice to share even
> > some very fundamental downtime measurement comparisons using the new exec
> > mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> > have some number on hand when you started working on this feature years
> > ago?  Or maybe some old links on the list would help too, as I didn't
> > follow this work since the start.
> > 
> > On VFIO, IIUC you started out this project without VFIO migration being
> > there.  Now we have VFIO migration so not sure how much it would work for
> > the upgrade use case. Even with current VFIO migration, we may not want to
> > migrate device states for a local upgrade I suppose, as that can be a lot
> > depending on the type of device assigned.  However it'll be nice to discuss
> > this too if this is the major purpose of the series.
> > 
> > I think one other challenge on QEMU upgrade with VFIO devices is that the
> > dest QEMU won't be able to open the VFIO device when the src QEMU is still
> > using it as the owner.  IIUC this is a similar condition where QEMU wants
> > to have proper ownership transfer of a shared block device, and AFAIR right
> > now we resolved that issue using some form of file lock on the image file.
> > In this case it won't easily apply to a VFIO dev fd, but maybe we still
> > have other approaches, not sure whether you investigated any.  E.g. could
> > the VFIO handle be passed over using unix scm rights?  I think this might
> > remove one dependency of using exec which can cause quite some difference
> > v.s. a generic migration (from which regard, cpr-reboot is still a pretty
> > generic migration).
> > 
> > You also mentioned vhost/tap, is that also a major goal of this series in
> > the follow up patchsets?  Is this a problem only because this solution will
> > do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> > vhost/tap fds when boot?
> > 
> > Meanwhile, could you elaborate a bit on the implication on chardevs?  From
> > what I read in the doc update it looks like a major part of work in the
> > future, but I don't yet understand the issue..  Is it also relevant to the
> > exec() approach?
> > 
> > In all cases, some of such discussion would be really appreciated.  And if
> > you used to consider other approaches to solve this problem it'll be great
> > to mention how you chose this way.  Considering this work contains too many
> > things, it'll be nice if such discussion can start with the fundamentals,
> > e.g. on why exec() is a must.
> 
> The main goal of cpr-exec is providing a fast and reliable way to update
> qemu. cpr-reboot is not fast enough or general enough.  It requires the
> guest to support suspend and resume for all devices, and that takes seconds.
> If one actually reboots the host, that adds more seconds, depending on
> system services.  cpr-exec takes 0.1 secs, and works every time, unlike
> like migration which can fail to converge on a busy system.  Live migration
> also consumes more system and network resources.

Right, but note that when I was thinking of a comparison between cpr-exec
v.s. normal migration, I didn't mean a "normal live migration".  I think
it's more of the case whether exec() can be avoided.  I had a feeling that
this ex

Re: [PATCH] tests/qtest/migrate-test: Use regular file file for shared-memory tests

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 09:35:22AM -0400, Peter Xu wrote:
> On Tue, May 28, 2024 at 02:27:57PM +1000, Nicholas Piggin wrote:
> > There is no need to use /dev/shm for file-backed memory devices, and
> > it is too small to be usable in gitlab CI. Switch to using a regular
> > file in /tmp/ which will usually have more space available.
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> > Am I missing something? AFAIKS there is not even any point using
> > /dev/shm aka tmpfs anyway, there is not much special about it as a
> > filesystem. This applies on top of the series just sent, and passes
> > gitlab CI qtests including aarch64.
> 
> I think it's just that /dev/shm guarantees shmem usage, while the var
> "tmpfs" implies g_dir_make_tmp() which may be another non-ram based file
> system, while that'll be slightly different comparing to what a real user
> would use - we don't suggest user to put guest RAM on things like btrfs.
> 
> One real implication is if we add a postcopy test it'll fail with
> g_dir_make_tmp() when it is not pointing to a shmem mount, as
> UFFDIO_REGISTER will fail there.  But that test doesn't yet exist as the
> QEMU paths should be the same even if Linux will trigger different paths
> when different types of mem is used (anonymous v.s. shmem).
> 
> If the goal here is to properly handle the case where tmpfs doesn't have
> enough space, how about what I suggested in the other email?
> 
> https://lore.kernel.org/r/ZlSppKDE6wzjCF--@x1n
> 
> IOW, try populate the shmem region before starting the guest, skip if
> population failed.  Would that work?

Let me append some more info here..

I think madvise() isn't needed as fallocate() should do the population work
already, afaiu, then it means we pass the shmem path to QEMU and QEMU
should notice this memory-backend-file existed, open() directly.

I quicked walk the QEMU memory code and so far it looks all applicable, so
that QEMU should just start the guest with the pre-populated shmem page
caches.

There's one trick where qemu_ram_mmap() will map some extra pages, on x86
4k, and I don't yet know why we did that..

/*
 * Note: this always allocates at least one extra page of virtual address
 * space, even if size is already aligned.
 */
total = size + align;

But that was only used in mmap_reserve() not mmap_activate(), so the real
mmap() should still be exactly what we fallocate()ed.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 09:06:04AM +, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang 
> > Cc: Michael Galaxy ; Jinpu Wang
> > ; Elmar Gerdes ;
> > zhengchuan ; Gonglei (Arei)
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-devel@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-bl...@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> > 
> > Hi,
> > 
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> > 
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more 
> > reasonable.
> > 
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wvaqk81vxtkzx-l...@mail.gmail.com/
> > 
> > Appreciate a lot for everyone helping on the testings.
> > 
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> > 
> > It looks to me NICs are powerful now, but again as I mentioned I don't 
> > think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> > 
> > Is there anyone who started looking into that direction?  Would it make 
> > sense
> > we start some PoC now?
> > 
> 
> My team has finished the PoC refactoring which works well. 
> 
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
> successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and 
> rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
> migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live 
> migration. It's successful.
> 
> We will be submit the patchset later.

That's great news, thank you!

-- 
Peter Xu




Re: [RFC PATCH 2/4] tests/qtest/migration: Add a test that runs vmstate-static-checker

2024-05-28 Thread Peter Xu
On Mon, May 27, 2024 at 07:52:28PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, May 23, 2024 at 05:19:20PM -0300, Fabiano Rosas wrote:
> >> We have the vmstate-static-checker script that takes the output of:
> >> '$QEMU -M $machine -dump-vmstate' for two different QEMU versions and
> >> compares them to check for compatibility breakages. This is just too
> >> simple and useful for us to pass on it. Add a test that runs the
> >> script.
> >> 
> >> Since this needs to use two different QEMU versions, the test is
> >> skipped if only one QEMU is provided. The infrastructure for passing
> >> more than one binary is already in place:
> >> 
> >> $ PYTHON=$(which python3.11) \
> >>  QTEST_QEMU_BINARY_SRC=../build-previous/qemu-system-x86_64 \
> >>  QTEST_QEMU_BINARY=./qemu-system-x86_64 \
> >>  ./tests/qtest/migration-test -p /x86_64/migration/vmstate-checker-script
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >> ---
> >> some code duplication for now, just so we can reason about this
> >> without too much noise
> >> ---
> >>  tests/qtest/migration-test.c | 82 
> >>  1 file changed, 82 insertions(+)
> >> 
> >> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> >> index e8d3555f56..2253e0fc5b 100644
> >> --- a/tests/qtest/migration-test.c
> >> +++ b/tests/qtest/migration-test.c
> >> @@ -63,6 +63,7 @@ static QTestMigrationState dst_state;
> >>  #define DIRTYLIMIT_TOLERANCE_RANGE  25  /* MB/s */
> >>  
> >>  #define ANALYZE_SCRIPT "scripts/analyze-migration.py"
> >> +#define VMSTATE_CHECKER_SCRIPT "scripts/vmstate-static-checker.py"
> >>  
> >>  #define QEMU_VM_FILE_MAGIC 0x5145564d
> >>  #define FILE_TEST_FILENAME "migfile"
> >> @@ -1611,6 +1612,85 @@ static void test_analyze_script(void)
> >>  test_migrate_end(from, to, false);
> >>  cleanup("migfile");
> >>  }
> >> +
> >> +static void test_vmstate_checker_script(void)
> >> +{
> >> +g_autofree gchar *cmd_src = NULL;
> >> +g_autofree gchar *cmd_dst = NULL;
> >> +g_autofree gchar *vmstate_src = NULL;
> >> +g_autofree gchar *vmstate_dst = NULL;
> >> +const char *machine_alias, *machine_opts = "";
> >> +g_autofree char *machine = NULL;
> >> +const char *arch = qtest_get_arch();
> >> +int pid, wstatus;
> >> +const char *python = g_getenv("PYTHON");
> >> +
> >> +if (!getenv(QEMU_ENV_SRC) && !getenv(QEMU_ENV_DST)) {
> >> +g_test_skip("Test needs two different QEMU versions");
> >> +return;
> >> +}
> >> +
> >> +if (!python) {
> >> +g_test_skip("PYTHON variable not set");
> >> +return;
> >> +}
> >> +
> >> +if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
> >> +if (g_str_equal(arch, "i386")) {
> >> +machine_alias = "pc";
> >> +} else {
> >> +machine_alias = "q35";
> >> +}
> >> +} else if (g_str_equal(arch, "s390x")) {
> >> +machine_alias = "s390-ccw-virtio";
> >> +} else if (strcmp(arch, "ppc64") == 0) {
> >> +machine_alias = "pseries";
> >> +} else if (strcmp(arch, "aarch64") == 0) {
> >> +machine_alias = "virt";
> >> +} else {
> >> +g_assert_not_reached();
> >> +}
> >> +
> >> +if (!qtest_has_machine(machine_alias)) {
> >> +g_autofree char *msg = g_strdup_printf("machine %s not 
> >> supported", machine_alias);
> >> +g_test_skip(msg);
> >> +return;
> >> +}
> >> +
> >> +machine = resolve_machine_version(machine_alias, QEMU_ENV_SRC,
> >> +  QEMU_ENV_DST);
> >> +
> >> +vmstate_src = g_strdup_printf("%s/vmstate-src", tmpfs);
> >> +vmstate_dst = g_strdup_printf("%s/vmstate-dst", tmpfs);
> >> +
> >> +cmd_dst = g_strdup_printf("-machine %s,%s -dump-vmstate %s",
> >> +  machine, machine_opts, vmstate_dst);
> >&g

Re: [RFC PATCH 4/4] ci: Add the new migration device tests

2024-05-28 Thread Peter Xu
On Mon, May 27, 2024 at 08:59:00PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, May 23, 2024 at 05:19:22PM -0300, Fabiano Rosas wrote:
> >> We have two new migration tests that check cross version
> >> compatibility. One uses the vmstate-static-checker.py script to
> >> compare the vmstate structures from two different QEMU versions. The
> >> other runs a simple migration with a few devices present in the VM, to
> >> catch obvious breakages.
> >> 
> >> Add both tests to the migration-compat-common job.
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >> ---
> >>  .gitlab-ci.d/buildtest.yml | 43 +++---
> >>  1 file changed, 36 insertions(+), 7 deletions(-)
> >> 
> >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> >> index 91c57efded..bc7ac35983 100644
> >> --- a/.gitlab-ci.d/buildtest.yml
> >> +++ b/.gitlab-ci.d/buildtest.yml
> >> @@ -202,18 +202,47 @@ build-previous-qemu:
> >>needs:
> >>  - job: build-previous-qemu
> >>  - job: build-system-opensuse
> >> -  # The old QEMU could have bugs unrelated to migration that are
> >> -  # already fixed in the current development branch, so this test
> >> -  # might fail.
> >> +  # This test is allowed to fail because:
> >> +  #
> >> +  # - The old QEMU could have bugs unrelated to migration that are
> >> +  #   already fixed in the current development branch.
> >
> > Did you ever hit a real failure with this?  I'm wondering whether we can
> > remove this allow_failure thing.
> >
> 
> I haven't. But when it fails we'll go through an entire release cycle
> with this thing showing red for every person that runs the CI. Remember,
> this is a CI failure to which there's no fix aside from waiting for the
> release to happen. Even if we're quick to react and disable the job, I
> feel it might create some confusion already.

My imagination was if needed we'll get complains and we add that until
then for that broken release only, and remove in the next release again.

> 
> >> +  #
> >> +  # - The vmstate-static-checker script trips on renames and other
> >> +  #   backward-compatible changes to the vmstate structs.
> >
> > I think I keep my preference per last time we talked on this. :)
> 
> Sorry, I'm not trying to force this in any way, I just wrote these to
> use in the pull-request and thought I'd put it out there. At the very
> least we can have your concerns documented. =)

Yep that's fine.  I think we should keep such discussion on the list,
especially we have different opinions, while none of us got convinced yet
so far. :)

> 
> > I still think it's too early to involve a test that can report false
> > negative.
> 
> (1)
> Well, we haven't seen any false negatives, we've seen fields being
> renamed. If that happens, then we'll ask the person to update the
> script. Is that not acceptable to you? Or are you thinking about other
> sorts of issues?

Then question is how to update the script. So far it's treated as failure
on rename, even if it's benign. Right now we have this:

print("Section \"" + sec + "\",", end=' ')
print("Description \"" + desc + "\":", end=' ')
print("expected field \"" + s_item["field"] + "\",", end=' ')
print("got \"" + d_item["field"] + "\"; skipping rest")
bump_taint()
break

Do you want to introduce a list of renamed vmsd fields in this script and
maintain that?  IMHO it's an overkill and unnecessary burden to other
developers.

> 
> > I'd still keep running this before soft-freeze like I used to
> > do, throw issues to others and urge them to fix before release.
> 
> Having hidden procedures that maintainers run before a release is bad
> IMHO, it just delays the catching of bugs and frustrates
> contributors. Imagine working on a series, everything goes well with
> reviews, CI passes, patch gets queued and merged and a month later you
> get a ping about something you should have done to avoid breaking
> migration. Right during freeze.

I understand your point, however I don't yet see a way CI could cover
everything.  CI won't cover performance test, and I still ran multifd tests
at least smoke it too before soft-freeze.

If there's something done wrong, we notify the sooner the better.  Now it
looks to me the best trade-off is like that - we notify at soft-freeze once
per release considering that's pretty rare too, e.g. 9.0 has n

Re: [RFC PATCH 3/4] tests/qtest/migration: Add support for simple device tests

2024-05-28 Thread Peter Xu
On Mon, May 27, 2024 at 07:59:50PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, May 23, 2024 at 05:19:21PM -0300, Fabiano Rosas wrote:
> >> The current migration-tests are almost entirely focused on catching
> >> bugs on the migration code itself, not on the device migration
> >> infrastructure (vmstate). That means we miss catching some low hanging
> >> fruits that would show up immediately if only we had the device in
> >> question present in the VM.
> >> 
> >> Add a list of devices to include by default in the migration-tests,
> >> starting with one that recently had issues, virtio-gpu. Also add an
> >> environment variable QTEST_DEVICE_OPTS to allow test users to
> >> experiment with different devices or device options.
> >> 
> >> Do not run every migration test with the devices because that would
> >> increase the complexity of the command lines and, as mentioned, the
> >> migration-tests are mostly used to test the core migration code, not
> >> the device migration. Add a special value QTEST_DEVICE_OPTS=all that
> >> enables testing with devices.
> >> 
> >> Notes on usage:
> >> 
> >> For this new testing mode, it's not useful to run all the migration
> >> tests, a single test would probably suffice to catch any issues, so
> >> provide the -p option to migration-test and the test of your choice.
> >> 
> >> Like with the cross-version compatibility tests in CI and the recently
> >> introduced vmstate-static-checker test, to be of any use, a test with
> >> devices needs to be run against a different QEMU version, like so:
> >> 
> >> $ cd build
> >> $ QTEST_DEVICE_OPTS=all \
> >>  QTEST_QEMU_BINARY=./qemu-system-x86_64 \
> >>  QTEST_QEMU_BINARY_DST=../build-previous/qemu-system-x86_64 \
> >>  ./tests/qtest/migration-test -p /x86_64/migration/precopy/tcp/plain
> >> 
> >> $ cd build
> >> $ QTEST_DEVICE_OPTS='-device virtio-net' \
> >>  QTEST_QEMU_BINARY=./qemu-system-x86_64 \
> >>  QTEST_QEMU_BINARY_DST=../build-previous/qemu-system-x86_64 \
> >>  ./tests/qtest/migration-test -p /x86_64/migration/precopy/tcp/plain
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >> ---
> >>  tests/qtest/migration-test.c | 19 +--
> >>  1 file changed, 17 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> >> index 2253e0fc5b..35bb224d18 100644
> >> --- a/tests/qtest/migration-test.c
> >> +++ b/tests/qtest/migration-test.c
> >> @@ -71,6 +71,13 @@ static QTestMigrationState dst_state;
> >>  #define QEMU_ENV_SRC "QTEST_QEMU_BINARY_SRC"
> >>  #define QEMU_ENV_DST "QTEST_QEMU_BINARY_DST"
> >>  
> >> +/*
> >> + * The tests using DEFAULT_DEVICES need a special invocation and
> >> + * cannot be reached from make check, so don't bother with the
> >> + * --without-default-devices build.
> >
> > What's this "--without-default-devices"?
> 
> A configure option. It removes from the build any devices that are
> marked as default. It's an endless source of bugs because it is supposed
> to be paired with a config file that adds back some of the removed
> devices, but there's nothing enforcing that so we always run it as is
> and generate a broken QEMU binary.
> 
> So anything in the tests that refer to devices should first check if
> that QEMU binary even has the device present. I'm saying here that we're
> not going to do that because this test cannot be accidentally reached
> via make check. Realistically, most people will consume this test
> through the CI job only.

Ah I didn't expect that is an existing configure option.. then it is all
fine.

Reviewed-by: Peter Xu 

Thanks,

-- 
Peter Xu




Re: [PATCH v6 2/7] migration/multifd: put IOV initialization into compression method

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 01:36:38PM +, Liu, Yuan1 wrote:
> > -Original Message-
> > From: Peter Xu 
> > Sent: Tuesday, May 28, 2024 4:51 AM
> > To: Liu, Yuan1 
> > Cc: faro...@suse.de; qemu-devel@nongnu.org; Zou, Nanhai
> > 
> > Subject: Re: [PATCH v6 2/7] migration/multifd: put IOV initialization into
> > compression method
> > 
> > On Mon, May 06, 2024 at 12:57:46AM +0800, Yuan Liu wrote:
> > > Different compression methods may require different numbers of IOVs.
> > > Based on streaming compression of zlib and zstd, all pages will be
> > > compressed to a data block, so two IOVs are needed for packet header
> > > and compressed data block.
> > >
> > > Signed-off-by: Yuan Liu 
> > > Reviewed-by: Nanhai Zou 
> > > ---
> > >  migration/multifd-zlib.c |  7 +++
> > >  migration/multifd-zstd.c |  8 +++-
> > >  migration/multifd.c  | 22 --
> > >  3 files changed, 26 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
> > > index 737a9645d2..2ced69487e 100644
> > > --- a/migration/multifd-zlib.c
> > > +++ b/migration/multifd-zlib.c
> > > @@ -70,6 +70,10 @@ static int zlib_send_setup(MultiFDSendParams *p,
> > Error **errp)
> > >  goto err_free_zbuff;
> > >  }
> > >  p->compress_data = z;
> > > +
> > > +/* Needs 2 IOVs, one for packet header and one for compressed data
> > */
> > > +p->iov = g_new0(struct iovec, 2);
> > > +
> > >  return 0;
> > >
> > >  err_free_zbuff:
> > > @@ -101,6 +105,9 @@ static void zlib_send_cleanup(MultiFDSendParams *p,
> > Error **errp)
> > >  z->buf = NULL;
> > >  g_free(p->compress_data);
> > >  p->compress_data = NULL;
> > > +
> > > +g_free(p->iov);
> > > +p->iov = NULL;
> > >  }
> > >
> > >  /**
> > > diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
> > > index 256858df0a..ca17b7e310 100644
> > > --- a/migration/multifd-zstd.c
> > > +++ b/migration/multifd-zstd.c
> > > @@ -52,7 +52,6 @@ static int zstd_send_setup(MultiFDSendParams *p, Error
> > **errp)
> > >  struct zstd_data *z = g_new0(struct zstd_data, 1);
> > >  int res;
> > >
> > > -p->compress_data = z;
> > >  z->zcs = ZSTD_createCStream();
> > >  if (!z->zcs) {
> > >  g_free(z);
> > > @@ -77,6 +76,10 @@ static int zstd_send_setup(MultiFDSendParams *p,
> > Error **errp)
> > >  error_setg(errp, "multifd %u: out of memory for zbuff", p->id);
> > >  return -1;
> > >  }
> > > +p->compress_data = z;
> > > +
> > > +/* Needs 2 IOVs, one for packet header and one for compressed data
> > */
> > > +p->iov = g_new0(struct iovec, 2);
> > >  return 0;
> > >  }
> > >
> > > @@ -98,6 +101,9 @@ static void zstd_send_cleanup(MultiFDSendParams *p,
> > Error **errp)
> > >  z->zbuff = NULL;
> > >  g_free(p->compress_data);
> > >  p->compress_data = NULL;
> > > +
> > > +g_free(p->iov);
> > > +p->iov = NULL;
> > >  }
> > >
> > >  /**
> > > diff --git a/migration/multifd.c b/migration/multifd.c
> > > index f317bff077..d82885fdbb 100644
> > > --- a/migration/multifd.c
> > > +++ b/migration/multifd.c
> > > @@ -137,6 +137,13 @@ static int nocomp_send_setup(MultiFDSendParams *p,
> > Error **errp)
> > >  p->write_flags |= QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
> > >  }
> > >
> > > +if (multifd_use_packets()) {
> > > +/* We need one extra place for the packet header */
> > > +p->iov = g_new0(struct iovec, p->page_count + 1);
> > > +} else {
> > > +p->iov = g_new0(struct iovec, p->page_count);
> > > +}
> > > +
> > >  return 0;
> > >  }
> > >
> > > @@ -150,6 +157,8 @@ static int nocomp_send_setup(MultiFDSendParams *p,
> > Error **errp)
> > >   */
> > >  static void nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
> > >  {
> > > +g_free(p->iov);
> > > +p->iov = NULL;
> > >  return;
> > >  }
> &g

Re: [PATCH] tests/qtest/migrate-test: Use regular file file for shared-memory tests

2024-05-28 Thread Peter Xu
 args->opts_source ? args->opts_source : "",
>   ignore_stderr);
>  if (!args->only_target) {
> @@ -810,7 +795,7 @@ static int test_migrate_start(QTestState **from, 
> QTestState **to,
>   memory_size, tmpfs, uri,
>   arch_opts ? arch_opts : "",
>   arch_target ? arch_target : "",
> - shmem_opts ? shmem_opts : "",
> + memfile_opts ? memfile_opts : "",
>   args->opts_target ? args->opts_target : "",
>   ignore_stderr);
>  *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target);
> @@ -822,8 +807,8 @@ static int test_migrate_start(QTestState **from, 
> QTestState **to,
>   * Remove shmem file immediately to avoid memory leak in test failed 
> case.
>   * It's valid because QEMU has already opened this file
>   */
> -if (args->use_shmem) {
> -unlink(shmem_path);
> +if (args->use_memfile) {
> +unlink(memfile_path);
>  }
>  
>  return 0;
> @@ -1875,7 +1860,7 @@ static void test_ignore_shared(void)
>  g_autofree char *uri = g_strdup_printf("unix:%s/migsocket", tmpfs);
>  QTestState *from, *to;
>  MigrateStart args = {
> -.use_shmem = true,
> +.use_memfile = true,
>  };
>  
>  if (test_migrate_start(, , uri, )) {
> @@ -2033,7 +2018,7 @@ static void test_mode_reboot(void)
>  g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
> FILE_TEST_FILENAME);
>  MigrateCommon args = {
> -.start.use_shmem = true,
> +.start.use_memfile = true,
>  .connect_uri = uri,
>  .listen_uri = "defer",
>  .start_hook = test_mode_reboot_start
> -- 
> 2.43.0
> 

-- 
Peter Xu




Re: [RFC PATCH 4/4] ci: Add the new migration device tests

2024-05-27 Thread Peter Xu
On Thu, May 23, 2024 at 05:19:22PM -0300, Fabiano Rosas wrote:
> We have two new migration tests that check cross version
> compatibility. One uses the vmstate-static-checker.py script to
> compare the vmstate structures from two different QEMU versions. The
> other runs a simple migration with a few devices present in the VM, to
> catch obvious breakages.
> 
> Add both tests to the migration-compat-common job.
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  .gitlab-ci.d/buildtest.yml | 43 +++---
>  1 file changed, 36 insertions(+), 7 deletions(-)
> 
> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> index 91c57efded..bc7ac35983 100644
> --- a/.gitlab-ci.d/buildtest.yml
> +++ b/.gitlab-ci.d/buildtest.yml
> @@ -202,18 +202,47 @@ build-previous-qemu:
>needs:
>  - job: build-previous-qemu
>  - job: build-system-opensuse
> -  # The old QEMU could have bugs unrelated to migration that are
> -  # already fixed in the current development branch, so this test
> -  # might fail.
> +  # This test is allowed to fail because:
> +  #
> +  # - The old QEMU could have bugs unrelated to migration that are
> +  #   already fixed in the current development branch.

Did you ever hit a real failure with this?  I'm wondering whether we can
remove this allow_failure thing.

> +  #
> +  # - The vmstate-static-checker script trips on renames and other
> +  #   backward-compatible changes to the vmstate structs.

I think I keep my preference per last time we talked on this. :)

I still think it's too early to involve a test that can report false
negative.  I'd still keep running this before soft-freeze like I used to
do, throw issues to others and urge them to fix before release.  Per my
previous experience that doesn't consume me a lot of time, and it's not
common to see issues either.

So I want people to really pay attention when someone sees a migration CI
test failed, rather than we help people form the habit in "oh migration CI
failed again?  I think that's fine, it allows failing anyway".

So far I still don't see as much benefit to adding this if we need to pay
for the other false negative issue.  I'll fully support it if e.g. we can
fix the tool to avoid reporting false negatives, but that may take effort
that I didn't check.

>allow_failure: true
>variables:
>  IMAGE: opensuse-leap
>  MAKE_CHECK_ARGS: check-build
>script:
> -# Use the migration-tests from the older QEMU tree. This avoids
> -# testing an old QEMU against new features/tests that it is not
> -# compatible with.
> -- cd build-previous
> +- cd build
> +# device state static test: Tests the vmstate structures for
> +# compatibility across QEMU versions. Uses the latest version of
> +# the tests.
> +# old to new
> +- PYTHON=pyvenv/bin/python3
> +  QTEST_QEMU_BINARY_SRC=../build-previous/qemu-system-${TARGET}
> +  QTEST_QEMU_BINARY=./qemu-system-${TARGET}
> +  ./tests/qtest/migration-test -p 
> /${TARGET}/migration/vmstate-checker-script
> +# new to old skipped because vmstate version bumps are always
> +# backward incompatible.
> +
> +# device state runtime test: Performs a cross-version migration
> +# with a select list of devices (see DEFAULT_DEVICES in
> +# migration-test.c). Using the multifd tcp test here, but any will
> +# do.
> +# old to new
> +- QTEST_DEVICE_OPTS=all 
> QTEST_QEMU_BINARY_SRC=../build-previous/qemu-system-${TARGET}
> +  QTEST_QEMU_BINARY=./qemu-system-${TARGET} 
> ./tests/qtest/migration-test
> +  -p /${TARGET}/migration/multifd/tcp/channels/plain/none
> +# new to old
> +- QTEST_DEVICE_OPTS=all 
> QTEST_QEMU_BINARY_DST=../build-previous/qemu-system-${TARGET}
> +  QTEST_QEMU_BINARY=./qemu-system-${TARGET} 
> ./tests/qtest/migration-test
> +  -p /${TARGET}/migration/multifd/tcp/channels/plain/none
> +
> +# migration core tests: Use the migration-tests from the older
> +# QEMU tree. This avoids testing an old QEMU against new
> +# features/tests that it is not compatible with.
> +- cd ../build-previous
>  # old to new
>  - QTEST_QEMU_BINARY_SRC=./qemu-system-${TARGET}
>QTEST_QEMU_BINARY=../build/qemu-system-${TARGET} 
> ./tests/qtest/migration-test
> -- 
> 2.35.3
> 

-- 
Peter Xu




Re: [RFC PATCH 3/4] tests/qtest/migration: Add support for simple device tests

2024-05-27 Thread Peter Xu
On Thu, May 23, 2024 at 05:19:21PM -0300, Fabiano Rosas wrote:
> The current migration-tests are almost entirely focused on catching
> bugs on the migration code itself, not on the device migration
> infrastructure (vmstate). That means we miss catching some low hanging
> fruits that would show up immediately if only we had the device in
> question present in the VM.
> 
> Add a list of devices to include by default in the migration-tests,
> starting with one that recently had issues, virtio-gpu. Also add an
> environment variable QTEST_DEVICE_OPTS to allow test users to
> experiment with different devices or device options.
> 
> Do not run every migration test with the devices because that would
> increase the complexity of the command lines and, as mentioned, the
> migration-tests are mostly used to test the core migration code, not
> the device migration. Add a special value QTEST_DEVICE_OPTS=all that
> enables testing with devices.
> 
> Notes on usage:
> 
> For this new testing mode, it's not useful to run all the migration
> tests, a single test would probably suffice to catch any issues, so
> provide the -p option to migration-test and the test of your choice.
> 
> Like with the cross-version compatibility tests in CI and the recently
> introduced vmstate-static-checker test, to be of any use, a test with
> devices needs to be run against a different QEMU version, like so:
> 
> $ cd build
> $ QTEST_DEVICE_OPTS=all \
>  QTEST_QEMU_BINARY=./qemu-system-x86_64 \
>  QTEST_QEMU_BINARY_DST=../build-previous/qemu-system-x86_64 \
>  ./tests/qtest/migration-test -p /x86_64/migration/precopy/tcp/plain
> 
> $ cd build
> $ QTEST_DEVICE_OPTS='-device virtio-net' \
>  QTEST_QEMU_BINARY=./qemu-system-x86_64 \
>  QTEST_QEMU_BINARY_DST=../build-previous/qemu-system-x86_64 \
>  ./tests/qtest/migration-test -p /x86_64/migration/precopy/tcp/plain
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  tests/qtest/migration-test.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 2253e0fc5b..35bb224d18 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -71,6 +71,13 @@ static QTestMigrationState dst_state;
>  #define QEMU_ENV_SRC "QTEST_QEMU_BINARY_SRC"
>  #define QEMU_ENV_DST "QTEST_QEMU_BINARY_DST"
>  
> +/*
> + * The tests using DEFAULT_DEVICES need a special invocation and
> + * cannot be reached from make check, so don't bother with the
> + * --without-default-devices build.

What's this "--without-default-devices"?

Other than this it looks all good.. thanks.

-- 
Peter Xu




Re: [RFC PATCH 2/4] tests/qtest/migration: Add a test that runs vmstate-static-checker

2024-05-27 Thread Peter Xu
gt; +cleanup("vmstate-dst");
> +}

Did I ask before on whether this can be written without C?

I think this and also the analyze-script are more suitable to be written in
other ways, e.g., bash or python, no?

>  #endif
>  
>  static void test_precopy_common(MigrateCommon *args)
> @@ -3495,6 +3575,8 @@ int main(int argc, char **argv)
>  #ifndef _WIN32
>  if (!g_str_equal(arch, "s390x")) {
>  migration_test_add("/migration/analyze-script", test_analyze_script);
> +migration_test_add("/migration/vmstate-checker-script",
> +   test_vmstate_checker_script);
>  }
>  #endif
>  migration_test_add("/migration/precopy/unix/plain",
> -- 
> 2.35.3
> 

-- 
Peter Xu




Re: [PATCH v6 7/7] tests/migration-test: add qpl compression test

2024-05-27 Thread Peter Xu
On Mon, May 06, 2024 at 12:57:51AM +0800, Yuan Liu wrote:
> add qpl to compression method test for multifd migration
> 
> the qpl compression supports software path and hardware
> path(IAA device), and the hardware path is used first by
> default. If the hardware path is unavailable, it will
> automatically fallback to the software path for testing.
> 
> Signed-off-by: Yuan Liu 
> Reviewed-by: Nanhai Zou 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v6 2/7] migration/multifd: put IOV initialization into compression method

2024-05-27 Thread Peter Xu
(MULTIFD_VERSION);
> -
> -/* We need one extra place for the packet header */
> -p->iov = g_new0(struct iovec, page_count + 1);
> -} else {
> -p->iov = g_new0(struct iovec, page_count);
>  }
>  p->name = g_strdup_printf("multifdsend_%d", i);
>  p->page_size = qemu_target_page_size();
> @@ -1353,8 +1358,6 @@ static void 
> multifd_recv_cleanup_channel(MultiFDRecvParams *p)
>      p->packet_len = 0;
>  g_free(p->packet);
>  p->packet = NULL;
> -g_free(p->iov);
> -p->iov = NULL;
>  g_free(p->normal);
>  p->normal = NULL;
>  g_free(p->zero);
> @@ -1602,7 +1605,6 @@ int multifd_recv_setup(Error **errp)
>  p->packet = g_malloc0(p->packet_len);
>  }
>  p->name = g_strdup_printf("multifdrecv_%d", i);
> -p->iov = g_new0(struct iovec, page_count);
>  p->normal = g_new0(ram_addr_t, page_count);
>  p->zero = g_new0(ram_addr_t, page_count);
>  p->page_count = page_count;
> -- 
> 2.39.3
> 

-- 
Peter Xu




Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr

2024-05-27 Thread Peter Xu
On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:
> Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
> can be saved in the migration stream.  This will be needed for CPR.
> 
> Signed-off-by: Steve Sistare 

This is really tricky.

>From a first glance, I don't think migrating a VA is valid at all for
migration even if with exec.. and looks insane to me for a cross-process
migration, which seems to be allowed to use as a generic VMSD helper.. as
VA is the address space barrier for different processes and I think it
normally even apply to generic execve(), and we're trying to jailbreak for
some reason..

It definitely won't work for any generic migration as sizeof(void*) can be
different afaict between hosts, e.g. 32bit -> 64bit migrations.

Some description would be really helpful in this commit message,
e.g. explain the users and why.  Do we need to poison that for generic VMSD
use (perhaps with prefixed underscores)?  I think I'll need to read on the
rest to tell..

Thanks,

-- 
Peter Xu




Re: [PATCH V1 07/26] migration: VMStateId

2024-05-27 Thread Peter Xu
On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote:
> Define a type for the 256 byte id string to guarantee the same length is
> used and enforced everywhere.
> 
> Signed-off-by: Steve Sistare 
> ---
>  include/exec/ramblock.h | 3 ++-
>  include/migration/vmstate.h | 2 ++
>  migration/savevm.c  | 8 
>  migration/vmstate.c | 3 ++-
>  4 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 0babd10..61deefe 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -23,6 +23,7 @@
>  #include "cpu-common.h"
>  #include "qemu/rcu.h"
>  #include "exec/ramlist.h"
> +#include "migration/vmstate.h"
>  
>  struct RAMBlock {
>  struct rcu_head rcu;
> @@ -35,7 +36,7 @@ struct RAMBlock {
>  void (*resized)(const char*, uint64_t length, void *host);
>  uint32_t flags;
>  /* Protected by the BQL.  */
> -char idstr[256];
> +VMStateId idstr;
>  /* RCU-enabled, writes protected by the ramlist lock */
>  QLIST_ENTRY(RAMBlock) next;
>  QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;

Hmm.. Don't look like a good idea to include a migration header in
ramblock.h?  Is this ramblock change needed for this work?

-- 
Peter Xu




Re: [PATCH V1 05/26] migration: precreate vmstate

2024-05-27 Thread Peter Xu
On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:
> Provide the VMStateDescription precreate field to mark objects that must
> be loaded on the incoming side before devices have been created, because
> they provide properties that will be needed at creation time.  They will
> be saved to and loaded from their own QEMUFile, via
> qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
> functions are not yet called in this patch.  Allow them to be called
> before or after normal migration is active, when current_migration and
> current_incoming are not valid.
> 
> Signed-off-by: Steve Sistare 
> ---
>  include/migration/vmstate.h |  6 
>  migration/savevm.c  | 69 
> +
>  migration/savevm.h  |  3 ++
>  3 files changed, 73 insertions(+), 5 deletions(-)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 294d2d8..4691334 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -198,6 +198,12 @@ struct VMStateDescription {
>   * a QEMU_VM_SECTION_START section.
>   */
>  bool early_setup;
> +
> +/*
> + * Send/receive this object in the precreate migration stream.
> + */
> +bool precreate;
> +
>  int version_id;
>  int minimum_version_id;
>  MigrationPriority priority;
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 9789823..a30bcd9 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -239,6 +239,7 @@ static SaveState savevm_state = {
>  
>  #define SAVEVM_FOREACH(se, entry)\
>  QTAILQ_FOREACH(se, _state.handlers, entry)\
> +if (!se->vmsd || !se->vmsd->precreate)
>  
>  #define SAVEVM_FOREACH_ALL(se, entry)\
>  QTAILQ_FOREACH(se, _state.handlers, entry)
> @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, 
> SaveStateEntry *se,
>  }
>  }
>  
> +static bool send_section_footer(SaveStateEntry *se)
> +{
> +return (se->vmsd && se->vmsd->precreate) ||
> +   migrate_get_current()->send_section_footer;
> +}

Does the precreate vmsd "require" the footer?  Or it should also work?
IMHO it's less optimal to bind features without good reasons.

Thanks,

-- 
Peter Xu




Re: [PATCH V1 03/26] migration: SAVEVM_FOREACH

2024-05-27 Thread Peter Xu
On Mon, May 13, 2024 at 03:27:30PM -0400, Steven Sistare wrote:
> On 5/6/2024 7:17 PM, Fabiano Rosas wrote:
> > Steve Sistare  writes:
> > 
> > > Define an abstraction SAVEVM_FOREACH to loop over all savevm state
> > > handlers, and replace QTAILQ_FOREACH.  Define variants for ALL so
> > > we can loop over all handlers vs a subset of handlers in a subsequent
> > > patch, but at this time there is no distinction between the two.
> > > No functional change.
> > > 
> > > Signed-off-by: Steve Sistare 
> > > ---
> > >   migration/savevm.c | 55 
> > > +++---
> > >   1 file changed, 32 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index 4509482..6829ba3 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -237,6 +237,15 @@ static SaveState savevm_state = {
> > >   .global_section_id = 0,
> > >   };
> > > +#define SAVEVM_FOREACH(se, entry)\
> > > +QTAILQ_FOREACH(se, _state.handlers, entry)\
> > > +
> > > +#define SAVEVM_FOREACH_ALL(se, entry)\
> > > +QTAILQ_FOREACH(se, _state.handlers, entry)
> > 
> > This feels worse than SAVEVM_FOREACH_NOT_PRECREATED. We'll have to keep
> > coming back to the definition to figure out which FOREACH is the real
> > deal.
> 
> I take your point, but the majority of the loops do not care about precreated
> objects, so it seems backwards to make them more verbose with
> SAVEVM_FOREACH_NOT_PRECREATE.  I can go either way, but we need
> Peter's opinion also.

I don't have a strong opinion yet on the name, I think it'll be clearer
indeed when the _ALL() (or whatever other name) is used only when with the
real users.

OTOH, besides the name (which is much more trivial..) the precreated idea
in general is a bit scary to me.. if that was for a "workaround" to some
new ordering issue due to newly added dependencies on exec() support.
Maybe I'll understand better when I get to know better on the whole design.

-- 
Peter Xu




Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-27 Thread Peter Xu
On Mon, Apr 29, 2024 at 08:55:09AM -0700, Steve Sistare wrote:
> This patch series implements a minimal version of cpr-exec.  Future series
> will add support for:
>   * vfio
>   * chardev's without loss of connectivity
>   * vhost
>   * fine-grained seccomp controls
>   * hostmem-memfd
>   * cpr-exec migration test

Another request besides the questions I threw already.. could you push to a
tree where it has everything?  Maybe that'll help to review also the min set.

Thanks,

-- 
Peter Xu




Re: [PATCH V1 04/26] migration: delete unused parameter mis

2024-05-27 Thread Peter Xu
On Mon, Apr 29, 2024 at 08:55:13AM -0700, Steve Sistare wrote:
> Signed-off-by: Steve Sistare 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-27 Thread Peter Xu
On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> I understand, thanks.  If I can help with any of your todo list,
> just ask - steve

Thanks for offering the help, Steve.  Started looking at this today, then I
found that I miss something high-level.  Let me ask here, and let me
apologize already for starting to throw multiple questions..

IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
this case not host kernel but QEMU-only, and/or upper.

Is there any justification on why the complexity is needed here?  It looks
to me this one is more involved than cpr-reboot, so I'm thinking how much
we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
the min support, and if we even expect more to come, that's really
important, IMHO.

For example, what's the major motivation of this whole work?  Is that more
on performance, or is it more for supporting the special devices like VFIO
which we used to not support, or something else?  I can't find them in
whatever cover letter I can find, including this one.

Firstly, regarding performance, IMHO it'll be always nice to share even
some very fundamental downtime measurement comparisons using the new exec
mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
have some number on hand when you started working on this feature years
ago?  Or maybe some old links on the list would help too, as I didn't
follow this work since the start.

On VFIO, IIUC you started out this project without VFIO migration being
there.  Now we have VFIO migration so not sure how much it would work for
the upgrade use case. Even with current VFIO migration, we may not want to
migrate device states for a local upgrade I suppose, as that can be a lot
depending on the type of device assigned.  However it'll be nice to discuss
this too if this is the major purpose of the series.

I think one other challenge on QEMU upgrade with VFIO devices is that the
dest QEMU won't be able to open the VFIO device when the src QEMU is still
using it as the owner.  IIUC this is a similar condition where QEMU wants
to have proper ownership transfer of a shared block device, and AFAIR right
now we resolved that issue using some form of file lock on the image file.
In this case it won't easily apply to a VFIO dev fd, but maybe we still
have other approaches, not sure whether you investigated any.  E.g. could
the VFIO handle be passed over using unix scm rights?  I think this might
remove one dependency of using exec which can cause quite some difference
v.s. a generic migration (from which regard, cpr-reboot is still a pretty
generic migration).

You also mentioned vhost/tap, is that also a major goal of this series in
the follow up patchsets?  Is this a problem only because this solution will
do exec?  Can it work if either the exec()ed qemu or dst qemu create the
vhost/tap fds when boot?

Meanwhile, could you elaborate a bit on the implication on chardevs?  From
what I read in the doc update it looks like a major part of work in the
future, but I don't yet understand the issue..  Is it also relevant to the
exec() approach?

In all cases, some of such discussion would be really appreciated.  And if
you used to consider other approaches to solve this problem it'll be great
to mention how you chose this way.  Considering this work contains too many
things, it'll be nice if such discussion can start with the fundamentals,
e.g. on why exec() is a must.

Thanks,

-- 
Peter Xu




Re: [RFC PATCH 3/3] tests/qtest/migration-test: Enable test_ignore_shared

2024-05-27 Thread Peter Xu
On Mon, May 27, 2024 at 12:11:45PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Mon, May 27, 2024 at 09:42:28AM -0300, Fabiano Rosas wrote:
> >> However, there is an issue here still on all archs - which might very
> >> well have been the original issue - which is the fact that the
> >> containers on the Gitlab CI have limits on shared memory usage.
> >> Unfortunately we cannot enable this test for the CI, so it needs a check
> >> on the GITLAB_CI environment variable.
> >
> > Another option is we teach migration-test to detect whether memory_size of
> > shmem is available, skip if not.  It can be a sequence of:
> >
> >   memfd_create()
> >   fallocate()
> >   ret = madvise(MADV_POPULATE_WRITE)
> >
> > To be run at the entry of migration-test, and skip all use_shmem=true tests
> > if ret != 0, or any step failed above.
> 
> There are actually two issues:
> 
> 1) Trying to run a test that needs more shmem than available in the
> container. This is covered well by your suggestion.
> 
> 2) Trying to use some shmem while another test has already consumed all
> shmem. I'm not sure if this can be done reliably as the tests run in
> parallel.

Maybe we can also make that check to be per-test, then when use_shmem=true
the test populates the shmem file before using, skip if population fails.
And if it succeeded, using that file in that test should be reliable.

-- 
Peter Xu




Re: [RFC PATCH 3/3] tests/qtest/migration-test: Enable test_ignore_shared

2024-05-27 Thread Peter Xu
On Mon, May 27, 2024 at 09:42:28AM -0300, Fabiano Rosas wrote:
> However, there is an issue here still on all archs - which might very
> well have been the original issue - which is the fact that the
> containers on the Gitlab CI have limits on shared memory usage.
> Unfortunately we cannot enable this test for the CI, so it needs a check
> on the GITLAB_CI environment variable.

Another option is we teach migration-test to detect whether memory_size of
shmem is available, skip if not.  It can be a sequence of:

  memfd_create()
  fallocate()
  ret = madvise(MADV_POPULATE_WRITE)

To be run at the entry of migration-test, and skip all use_shmem=true tests
if ret != 0, or any step failed above.

Thanks,

-- 
Peter Xu




Re: [PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state

2024-05-27 Thread Peter Xu
On Mon, May 27, 2024 at 12:53:22PM +0200, Markus Armbruster wrote:
> Peter Xu  writes:
> 
> > On Mon, May 13, 2024 at 04:17:02PM +0200, Markus Armbruster wrote:
> >> Functions that use an Error **errp parameter to return errors should
> >> not also report them to the user, because reporting is the caller's
> >> job.  When the caller does, the error is reported twice.  When it
> >> doesn't (because it recovered from the error), there is no error to
> >> report, i.e. the report is bogus.
> >> 
> >> qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate
> >> this principle: they call qemu_save_device_state() and
> >> qemu_loadvm_state(), which call error_report_err().
> >> 
> >> I wish I could clean this up now, but migration's error reporting is
> >> too complicated (confused?) for me to mess with it.
> >
> > :-(
> 
> If I understood how it's *supposed* to work, I might have a chance...
> 
> I can see a mixture of reporting errors directly (with error_report() &
> friends), passing them to callers (via Error **errp), and storing them
> in / retrieving them from MigrationState member @error.  This can't be
> right.
> 
> I think a necessary first step towards getting it right is a shared
> understanding how errors are to be handled in migration code.  This
> includes how error data should flow from error source to error sink, and
> what the possible sinks are.

True.  I think the sink should always be MigrationState.error so far.

There's also the other complexity on detecting errors using either
qemu_file_get_error() or migrate_get_error().. the error handling in
migration is indeed prone to a cleanup.

I've added a cleanup entry for migration todo page:

https://wiki.qemu.org/ToDo/LiveMigration#Migration_error_detection_and_reporting

Thanks,

-- 
Peter Xu




Re: [PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state

2024-05-23 Thread Peter Xu
On Mon, May 13, 2024 at 04:17:02PM +0200, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate
> this principle: they call qemu_save_device_state() and
> qemu_loadvm_state(), which call error_report_err().
> 
> I wish I could clean this up now, but migration's error reporting is
> too complicated (confused?) for me to mess with it.

:-(

> 
> Instead, I'm merely improving the error reported by
> qmp_xen_load_devices_state() and qmp_xen_load_devices_state() to the
> QMP core from
> 
> An IO error has occurred
> 
> to
> saving Xen device state failed
> 
> and
> 
> loading Xen device state failed
> 
> respectively.
> 
> Signed-off-by: Markus Armbruster 

Acked-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH] tests/qtest/migration-test: Run some basic tests on s390x and ppc64 with TCG, too

2024-05-22 Thread Peter Xu
On Wed, May 22, 2024 at 11:12:55AM +0200, Thomas Huth wrote:
> On s390x, we recently had a regression that broke migration / savevm
> (see commit bebe9603fc ("hw/intc/s390_flic: Fix crash that occurs when
> saving the machine state"). The problem was merged without being noticed
> since we currently do not run any migration / savevm related tests on
> x86 hosts.
> While we currently cannot run all migration tests for the s390x target
> on x86 hosts yet (due to some unresolved issues with TCG), we can at
> least run some of the non-live tests to avoid such problems in the future.
> Thus enable the "analyze-script" and the "bad_dest" tests before checking
> for KVM on s390x or ppc64 (this also fixes the problem that the
> "analyze-script" test was not run on s390x at all anymore since it got
> disabled again by accident in a previous refactoring of the code).
> 
> Signed-off-by: Thomas Huth 

Reviewed-by: Peter Xu 

Irrelevant of this patch, when I was looking at cleaning of bootfile it
looks to me that we're leaking bootpath in the test loops.. maybe we need a
bootfile_delete() at the entry of bootfile_create()?

Or even better, prepare bootfile once only and use it in all tests.  The
only trick arch is x86 who needs to support suspend_me=true, but maybe we
can provide two bootfiles.

-- 
Peter Xu




Re: [PATCH] tests/qtest/migration-test: Fix the check for a successful run of analyze-migration.py

2024-05-22 Thread Peter Xu
On Wed, May 22, 2024 at 11:23:01AM +0200, Thomas Huth wrote:
> If analyze-migration.py cannot be run or crashes, the error is currently
> ignored since the code only checks for nonzero values in case the child
> exited properly. For example, if you run the test with a non-existing
> Python interpreter, it still succeeds:
> 
>  $ PYTHON=wrongpython QTEST_QEMU_BINARY=./qemu-system-x86_64 
> tests/qtest/migration-test
>  ...
>  # Running /x86_64/migration/analyze-script
>  # Using machine type: pc-q35-9.1
>  # starting QEMU: exec ./qemu-system-x86_64 -qtest 
> unix:/tmp/qtest-417639.sock -qtest-log /dev/null -chardev 
> socket,path=/tmp/qtest-417639.qmp,id=char0 -mon chardev=char0,mode=control 
> -display none -audio none -accel kvm -accel tcg -machine pc-q35-9.1, -name 
> source,debug-threads=on -m 150M -serial 
> file:/tmp/migration-test-XPLUN2/src_serial -drive 
> if=none,id=d0,file=/tmp/migration-test-XPLUN2/bootsect,format=raw -device 
> ide-hd,drive=d0,secs=1,cyls=1,heads=1   -uuid 
> ----  -accel qtest
>  # starting QEMU: exec ./qemu-system-x86_64 -qtest 
> unix:/tmp/qtest-417639.sock -qtest-log /dev/null -chardev 
> socket,path=/tmp/qtest-417639.qmp,id=char0 -mon chardev=char0,mode=control 
> -display none -audio none -accel kvm -accel tcg -machine pc-q35-9.1, -name 
> target,debug-threads=on -m 150M -serial 
> file:/tmp/migration-test-XPLUN2/dest_serial -incoming tcp:127.0.0.1:0 -drive 
> if=none,id=d0,file=/tmp/migration-test-XPLUN2/bootsect,format=raw -device 
> ide-hd,drive=d0,secs=1,cyls=1,heads=1 -accel qtest
>  **
>  
> ERROR:../../devel/qemu/tests/qtest/migration-test.c:1603:test_analyze_script: 
> code should not be reached
>  migration-test: ../../devel/qemu/tests/qtest/libqtest.c:240: 
> qtest_wait_qemu: Assertion `pid == s->qemu_pid' failed.
>  migration-test: ../../devel/qemu/tests/qtest/libqtest.c:240: 
> qtest_wait_qemu: Assertion `pid == s->qemu_pid' failed.
>  ok 2 /x86_64/migration/analyze-script
>  ...
> 
> Let's better fail the test in case the child did not exit properly, too.
> 
> Signed-off-by: Thomas Huth 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH] hw/core/machine: move compatibility flags for VirtIO-net USO to machine 8.1

2024-05-21 Thread Peter Xu
On Fri, May 17, 2024 at 09:53:36AM +0200, Fiona Ebner wrote:
> Migration from an 8.2 or 9.0 binary to an 8.1 binary with machine
> version 8.1 can fail with:
> 
> > kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7
> > kvm: Failed to load virtio-net:virtio
> > kvm: error while loading state for instance 0x0 of device 
> > ':00:12.0/virtio-net'
> > kvm: load of migration failed: Operation not permitted
> 
> The series
> 
> 53da8b5a99 virtio-net: Add support for USO features
> 9da1684954 virtio-net: Add USO flags to vhost support.
> f03e0cf63b tap: Add check for USO features
> 2ab0ec3121 tap: Add USO support to tap device.
> 
> only landed in QEMU 8.2, so the compatibility flags should be part of
> machine version 8.1.
> 
> Moving the flags unfortunately breaks forward migration with machine
> version 8.1 from a binary without this patch to a binary with this
> patch.
> 

I think it'll be nicer we always copy stable in the commit log, so that
"whether to copy stable" is stick with the patch:

Cc: qemu-stable 

Even though it's not required for qemu stable process (I'd say that's
because we have a good stable tree maintainer..).

So we need this in qemu 8.1, 8.2 and 9.0?

> Fixes: 53da8b5a99 ("virtio-net: Add support for USO features")
> Signed-off-by: Fiona Ebner 

Reviewed-by: Peter Xu 

> ---
>  hw/core/machine.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index c7ceb11501..95051b80db 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -50,15 +50,15 @@ GlobalProperty hw_compat_8_1[] = {
>  { "ramfb", "x-migrate", "off" },
>  { "vfio-pci-nohotplug", "x-ramfb-migrate", "off" },
>  { "igb", "x-pcie-flr-init", "off" },
> +{ TYPE_VIRTIO_NET, "host_uso", "off"},
> +{ TYPE_VIRTIO_NET, "guest_uso4", "off"},
> +{ TYPE_VIRTIO_NET, "guest_uso6", "off"},
>  };
>  const size_t hw_compat_8_1_len = G_N_ELEMENTS(hw_compat_8_1);
>  
>  GlobalProperty hw_compat_8_0[] = {
>  { "migration", "multifd-flush-after-each-section", "on"},
>  { TYPE_PCI_DEVICE, "x-pcie-ari-nextfn-1", "on" },
> -{ TYPE_VIRTIO_NET, "host_uso", "off"},
> -{ TYPE_VIRTIO_NET, "guest_uso4", "off"},
> -{ TYPE_VIRTIO_NET, "guest_uso6", "off"},
>  };
>  const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>  
> -- 
> 2.39.2
> 
> 
> 

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-21 Thread Peter Xu
On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> Hello Michael and Peter,

Hi,

> 
> Exactly, not so compelling, as I did it first only on servers widely
> used for production in our data center. The network adapters are
> 
> Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> 2-port Gigabit Ethernet PCIe

Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
reasonable.

https://lore.kernel.org/qemu-devel/camgffen-dkpmz4ta71mjydyemg0zda15wvaqk81vxtkzx-l...@mail.gmail.com/

Appreciate a lot for everyone helping on the testings.

> InfiniBand controller: Mellanox Technologies MT27800 Family [ConnectX-5]
> 
> which doesn't meet our purpose. I can choose RDMA or TCP for VM
> migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> on these two hosts. One is standby while the other is active.
> 
> Now I'll try on a server with more recent Ethernet and InfiniBand
> network adapters. One of them has:
> BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> 
> The comparison between RDMA and TCP on the same NIC could make more sense.

It looks to me NICs are powerful now, but again as I mentioned I don't
think it's a reason we need to deprecate rdma, especially if QEMU's rdma
migration has the chance to be refactored using rsocket.

Is there anyone who started looking into that direction?  Would it make
sense we start some PoC now?

Thanks,

-- 
Peter Xu




Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-20 Thread Peter Xu
Conference back then pto until today, so tomorrow will be my first working
day after those. Sorry Steve, will try my best to read it before next week.
I didn't dare to read too much my inbox yet.  A bit scared but need to face
it tomorrow.

On Mon, May 20, 2024, 6:28 p.m. Fabiano Rosas  wrote:

> Steven Sistare  writes:
>
> > Hi Peter, Hi Fabiano,
> >Will you have time to review the migration guts of this series any
> time soon?
> > In particular:
> >
> > [PATCH V1 05/26] migration: precreate vmstate
> > [PATCH V1 06/26] migration: precreate vmstate for exec
> > [PATCH V1 12/26] migration: vmstate factory object
> > [PATCH V1 18/26] migration: cpr-exec-args parameter
> > [PATCH V1 20/26] migration: cpr-exec mode
> >
>
> I'll get to them this week. I'm trying to make some progress with my own
> code before I forget how to program. I'm also trying to find some time
> to implement the device options in the migration tests so we can stop
> these virtio-* breakages that have been popping up.
>
>


Re: [PATCH v4 0/3] Fix "virtio-gpu: fix scanout migration post-load"

2024-05-16 Thread Peter Xu
Looks good here, thanks.

On Thu, May 16, 2024, 2:40 a.m.  wrote:

> From: Marc-André Lureau 
>
> Hi,
>
> The aforementioned patch breaks virtio-gpu device migrations for versions
> pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is
> more
> complex than it may initially appear, as evidenced in the problematic
> commit
> dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
>
> v2:
>  - use a manual version field test (instead of the more complex struct
> variant)
>
> v3:
>  - introduce machine_check_version()
>  - drop the VMSD version, and use machine version field test
>
> v4:
>  - drop machine_check_version() approach
>  - property renamed to x-scanout-vmstate-version
>
> Marc-André Lureau (3):
>   migration: add "exists" info to load-state-field trace
>   migration: fix a typo
>   virtio-gpu: fix v2 migration
>
>  include/hw/virtio/virtio-gpu.h |  1 +
>  hw/core/machine.c  |  1 +
>  hw/display/virtio-gpu.c| 24 
>  migration/vmstate.c|  7 ---
>  migration/trace-events |  2 +-
>  5 files changed, 23 insertions(+), 12 deletions(-)
>
> --
> 2.41.0.28.gd7d8841f67
>
>


Re: [PATCH v3 5/5] virtio-gpu: fix v2 migration

2024-05-15 Thread Peter Xu
On Wed, May 15, 2024 at 06:15:48PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 15, 2024 at 11:03:27AM -0600, Peter Xu wrote:
> > On Wed, May 15, 2024 at 05:03:44PM +0100, Daniel P. Berrangé wrote:
> > > Above all, I'm failing to see why there's a compelling reason
> > > for virtio_gpu to diverge from our long standing practice of
> > > adding a named property flag "virtio_scanout_vmstate_fix"
> > > on the machine class, and then setting it in machine types
> > > which need it.
> > 
> > The reason to introduce that is definitely avoid introducing fields /
> > properties in similar cases in which case all the fields may represent the
> > same thing ("return true if MC is older than xxx version").  Especially
> > when such change is not bound to a new feature so in which case it won't
> > make sense to allow user to even control that propoerty, even if we
> > exported this "x-virtio-scanout-fix" property, but now we must export it
> > because compat fields require it.
> > 
> > However I think agree that having upstream specific MC versions in VMSD
> > checks is kind of unwanted.  I think the major problem is we don't have
> > that extra machine type abstract where we can have simply a number showing
> > the release of QEMU, then we can map that number to whatever
> > upstream/downstream machine types.  E.g.:
> > 
> >   Release No. Upstream version   Downstream version
> >   50  9.0Y.0
> >   51  9.1
> >   52  9.2Y.1
> >   ...
> 
> Downstream versions do not map cleanly to individual upstream versions
> across the whole code base. If we have two distinct features in upstream
> version X, each of them may map to a different downstream release. 
> 
> This can happen when downstream skips one or more upstream releases.
> One feature from the skipped release might be backported to an earlier
> downstream release, while other feature might not arrive downstream
> until they later rebase. Version based checks are an inherantly
> undesirable idea for a situation where there is any backporting taking
> place, whether its machine type versions or something else. Named feature
> / flag based checks are always the way to go.

I thought this should work better with things like this where we only want
to fix a break in ABI, and none of downstream should special case things
like such fix.. but I agree even with that in mind such case could be so
rare to bother with above scheme.  I could have raised a bad idea I
suppose. :-( Let's stick with the simple until someone has better idea.

Thanks,

-- 
Peter Xu




Re: [PATCH v3 5/5] virtio-gpu: fix v2 migration

2024-05-15 Thread Peter Xu
On Wed, May 15, 2024 at 05:03:44PM +0100, Daniel P. Berrangé wrote:
> Above all, I'm failing to see why there's a compelling reason
> for virtio_gpu to diverge from our long standing practice of
> adding a named property flag "virtio_scanout_vmstate_fix"
> on the machine class, and then setting it in machine types
> which need it.

The reason to introduce that is definitely avoid introducing fields /
properties in similar cases in which case all the fields may represent the
same thing ("return true if MC is older than xxx version").  Especially
when such change is not bound to a new feature so in which case it won't
make sense to allow user to even control that propoerty, even if we
exported this "x-virtio-scanout-fix" property, but now we must export it
because compat fields require it.

However I think agree that having upstream specific MC versions in VMSD
checks is kind of unwanted.  I think the major problem is we don't have
that extra machine type abstract where we can have simply a number showing
the release of QEMU, then we can map that number to whatever
upstream/downstream machine types.  E.g.:

  Release No. Upstream version   Downstream version
  50  9.0Y.0
  51  9.1
  52  9.2Y.1
  ...

Then downstream is not mapping to 9.0/... but the release no.  Then here
instead of hard code upstream MC versions we can already provide similar
helpers like:

  machine_type_newer_than_50()

Then device code can use it without polluting that with upstream MC
versioning.  Downstream will simply work if downstream MCs are mapped
alright to the release no. when rebase.

But I'm not sure whether it'll be even worthwhile.. the majority will still
be that the VMSD change is caused by a new feature, and exporting that
property might in most cases be wanted.

In all cases, for now I agree it's at least easier to stick with the simple
way.

Thanks,

-- 
Peter Xu




Re: [PATCH v3 5/5] virtio-gpu: fix v2 migration

2024-05-15 Thread Peter Xu
On Wed, May 15, 2024 at 12:02:49PM -0400, Michael S. Tsirkin wrote:
> On Wed, May 15, 2024 at 06:15:56PM +0400, marcandre.lur...@redhat.com wrote:
> > From: Marc-André Lureau 
> > 
> > Commit dfcf74fa ("virtio-gpu: fix scanout migration post-load") broke
> > forward/backward version migration. Versioning of nested VMSD structures
> > is not straightforward, as the wire format doesn't have nested
> > structures versions.
> > 
> > Use the previously introduced check_machine_version() function as a
> > field test to ensure proper saving/loading based on the machine version.
> > The VMSD.version is irrelevant now.
> > 
> > Fixes: dfcf74fa ("virtio-gpu: fix scanout migration post-load")
> > Suggested-by: Peter Xu 
> > Signed-off-by: Marc-André Lureau 
> 
> I don't get it. Our standard way to do it is:
> - add a property (begin name with x- so we don't commit to an API)
> - set from compat machinery
> - test property value in VMSTATE macros
> 
> Big advantage is, it works well with any downstreams
> which pick any properties they like.
> Why is this not a good fit here?

I think it'll simplify upstream to avoid introducing one new field + one
new property for each of such protocol change, which fundamentally are the
same thing.  But it's indeed a good point that such helper can slightly
complicate the backport a bit.. I assume a global replacement of versions
over the helper will be needed after downstream settles on how to map
downstream MCs to upstream's.

Thanks,

-- 
Peter Xu




Re: [PATCH v3 0/5] Fix "virtio-gpu: fix scanout migration post-load"

2024-05-15 Thread Peter Xu
On Wed, May 15, 2024 at 06:15:51PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> Hi,
> 
> The aforementioned patch breaks virtio-gpu device migrations for versions
> pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
> complex than it may initially appear, as evidenced in the problematic commit
> dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
> 
> v2:
>  - use a manual version field test (instead of the more complex struct 
> variant)
> 
> v3:
>  - introduce machine_check_version()
>  - drop the VMSD version, and use machine version field test

Thanks for trying this out already.

Last time I mentioned this may for the long term because I remember Dan and
Thomas were trying to work on some machine deprecation work, and maybe such
things may collapse with that work (and perhaps easier with that work
landed, too?).  Just to copy them both here so we know where we are now, as
I didn't follow that discussion.  IOW, patch 3/4 may need separate review
from outside migration..

The simpler solution is we stick with the customized field and simple fix
to the issue first, then whenever we have that new helper later we simply
use the new helper to replace the old, alongside we can drop the new field
/ property too as long as it is declared with "x-".  Might be easier to
backport too in this case.  Marc-Andre, what do you think?

Thanks,

-- 
Peter Xu




Re: [PATCH v2 3/4] virtio-gpu: add x-vmstate-version

2024-05-14 Thread Peter Xu
On Tue, May 14, 2024 at 11:25:26AM +0400, Marc-André Lureau wrote:
> Hi
> 
> On Tue, May 14, 2024 at 8:35 AM Peter Xu  wrote:
> >
> > Hey, Marc-Andre,
> >
> > On Mon, May 13, 2024 at 11:19:04AM +0400, marcandre.lur...@redhat.com wrote:
> > > diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c
> > > index ae831b6b3e..7f9fb5eacc 100644
> > > --- a/hw/display/virtio-gpu.c
> > > +++ b/hw/display/virtio-gpu.c
> > > @@ -1234,7 +1234,8 @@ static int virtio_gpu_save(QEMUFile *f, void 
> > > *opaque, size_t size,
> > >  }
> > >  qemu_put_be32(f, 0); /* end of list */
> > >
> > > -return vmstate_save_state(f, _virtio_gpu_scanouts, g, NULL);
> > > +return vmstate_save_state_v(f, _virtio_gpu_scanouts, g,
> > > +NULL, g->vmstate_version, NULL);
> > >  }
> > >
> > >  static bool virtio_gpu_load_restore_mapping(VirtIOGPU *g,
> > > @@ -1339,7 +1340,7 @@ static int virtio_gpu_load(QEMUFile *f, void 
> > > *opaque, size_t size,
> > >  }
> > >
> > >  /* load & apply scanout state */
> > > -vmstate_load_state(f, _virtio_gpu_scanouts, g, 1);
> > > +vmstate_load_state(f, _virtio_gpu_scanouts, g, 
> > > g->vmstate_version);
> >
> > [sorry for a late response; attending a conf, and will reply to the v1
> >  thread later for the other discussions..]
> >
> > These two changes shouldn't be needed if we go with the .field_exists()
> > approach, am I right?  IIUC in that case we can keep the version 1 here and
> > don't boost anything, because we relied on the machine versions.
> >
> > IIUC this might be the reason why we found 9.0 mahines are broken on
> > migration.  E.g, IIUC my original patch should work for 9.0<->9.0 too.
> >
> 
> Indeed, but for consistency, shouldn't it use the x-vmstate-version
> value for the top-level VMSD save/load ?
> 
> Otherwise, it feels a bit odd that this x-vmstate-version is only used
> for the nested "virtio-gpu-one-scanout" version.
> 
> Or perhaps we should rename it to x-scanout-vmstate-version ? wdyt

Makes sense to me.  In another place I used to have a field called
preempt_pre_7_2.. which is pretty wierd but it would be more accurate I
suppose if the field name reflects how that was defined, especially
differenciate that v.s. VMSD versioning as they're confusing indeed when
used together.

So if a rename I suppose we can even "vmstate-version".  I just wished VMSD
versioning could work with bi-directional migration already then we save
all the fuss... we used to have the chance before introducing
field_exists() (I suppose this one came later), but we didn't care or
notice at that time, sign.  We'll just need a handshake between src/dst so
that when src sees dst uses VMSD v1 it sends v1-only fields even if it
knows as far as v2.

For the long run we may be able to have some small helper so we fetch the
machine type globally, then maybe we can in the future pass in the test
function as:

bool test_function (void *opaque)
{
  return MACHINE_TYPE_BEFORE(8, 2);
}

Then it'll also avoid even introducing a variable.  Maybe we can provide
this test_function() directly too, one for each release.

Thanks,

> 
> 
> > Thanks,
> >
> > >
> > >  return 0;
> > >  }
> > > @@ -1659,6 +1660,7 @@ static Property virtio_gpu_properties[] = {
> > >  DEFINE_PROP_BIT("blob", VirtIOGPU, parent_obj.conf.flags,
> > >  VIRTIO_GPU_FLAG_BLOB_ENABLED, false),
> > >  DEFINE_PROP_SIZE("hostmem", VirtIOGPU, parent_obj.conf.hostmem, 0),
> > > +DEFINE_PROP_UINT8("x-vmstate-version", VirtIOGPU, vmstate_version, 
> > > 1),
> > >  DEFINE_PROP_END_OF_LIST(),
> > >  };
> > >
> > > --
> > > 2.41.0.28.gd7d8841f67
> > >
> >
> > --
> > Peter Xu
> >
> 

-- 
Peter Xu




Re: [PATCH v2 3/4] virtio-gpu: add x-vmstate-version

2024-05-13 Thread Peter Xu
Hey, Marc-Andre,

On Mon, May 13, 2024 at 11:19:04AM +0400, marcandre.lur...@redhat.com wrote:
> diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c
> index ae831b6b3e..7f9fb5eacc 100644
> --- a/hw/display/virtio-gpu.c
> +++ b/hw/display/virtio-gpu.c
> @@ -1234,7 +1234,8 @@ static int virtio_gpu_save(QEMUFile *f, void *opaque, 
> size_t size,
>  }
>  qemu_put_be32(f, 0); /* end of list */
>  
> -return vmstate_save_state(f, _virtio_gpu_scanouts, g, NULL);
> +return vmstate_save_state_v(f, _virtio_gpu_scanouts, g,
> +NULL, g->vmstate_version, NULL);
>  }
>  
>  static bool virtio_gpu_load_restore_mapping(VirtIOGPU *g,
> @@ -1339,7 +1340,7 @@ static int virtio_gpu_load(QEMUFile *f, void *opaque, 
> size_t size,
>  }
>  
>  /* load & apply scanout state */
> -vmstate_load_state(f, _virtio_gpu_scanouts, g, 1);
> +vmstate_load_state(f, _virtio_gpu_scanouts, g, 
> g->vmstate_version);

[sorry for a late response; attending a conf, and will reply to the v1
 thread later for the other discussions..]

These two changes shouldn't be needed if we go with the .field_exists()
approach, am I right?  IIUC in that case we can keep the version 1 here and
don't boost anything, because we relied on the machine versions.

IIUC this might be the reason why we found 9.0 mahines are broken on
migration.  E.g, IIUC my original patch should work for 9.0<->9.0 too.

Thanks,

>  
>  return 0;
>  }
> @@ -1659,6 +1660,7 @@ static Property virtio_gpu_properties[] = {
>  DEFINE_PROP_BIT("blob", VirtIOGPU, parent_obj.conf.flags,
>  VIRTIO_GPU_FLAG_BLOB_ENABLED, false),
>  DEFINE_PROP_SIZE("hostmem", VirtIOGPU, parent_obj.conf.hostmem, 0),
> +DEFINE_PROP_UINT8("x-vmstate-version", VirtIOGPU, vmstate_version, 1),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>  
> -- 
> 2.41.0.28.gd7d8841f67
> 

-- 
Peter Xu




Re: [PATCH v2 2/4] migration: fix a typo

2024-05-13 Thread Peter Xu
On Mon, May 13, 2024 at 11:19:03AM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> Signed-off-by: Marc-André Lureau 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH 3/4] virtio-gpu: use a VMState variant for the scanout field

2024-05-10 Thread Peter Xu
Hi, Marc-André,

On Fri, May 10, 2024 at 12:39:34PM +0400, Marc-André Lureau wrote:
> Since we don't have per VMSD version information on the wire, nested
> struct versioning is quite limited and cumbersome. I am not sure it
> can be changed without breaking the stream format, and whether it's
> worthwhile.

Right that's a major pain, and actually I just notice it..

I think it'll be much, much simpler if we keep vmsd version on the wire for
each VMSD (including struct fields), then it makes more sense to me.

Then when I went back and see again the VSTRUCT thing...  I can hardly
understand what it is doing, and also how it works at all.

Look at the current only IPMI user, who has:

VMSTATE_VSTRUCT_V(kcs, ISAIPMIKCSDevice, 2, vmstate_IPMIKCS,
  IPMIKCS, 2),

It is setting both vmsd version and struct_version to 2.  I can't tell why
it matters then if anyway both of the fields are the same..

When we do save(), there is:

} else if (field->flags & VMS_STRUCT) {
ret = vmstate_save_state(f, field->vmsd, curr_elem,
 vmdesc_loop);
} else if (field->flags & VMS_VSTRUCT) {
ret = vmstate_save_state_v(f, field->vmsd, curr_elem,
   vmdesc_loop,
   field->struct_version_id, errp);

When we load():

} else if (field->flags & VMS_STRUCT) {
ret = vmstate_load_state(f, field->vmsd, curr_elem,
 field->vmsd->version_id);
} else if (field->flags & VMS_VSTRUCT) {
ret = vmstate_load_state(f, field->vmsd, curr_elem,
 field->struct_version_id);
} else {

In this case, passing in struct_version==version should have zero effect
afaict, because the default behavior is passing in vmsd->version_id anyway.

Moreover, now I highly doubt whether the VMS_STRUCT whole thing makes sense
at all as you mentioned.  Especially on the load side, here we should rely
on vmstate_load_state() taking the last parameter as version_id on the
wire.  Here we're passing in the struct's version_id or struct_version_id,
and neither of them makes sense to me... if we miss that version_id
information, afaiu we should simply fix it and put it on the wire..  It'll
break migration, we may need to work that out, but I don't see a better
way.  Keeping it like this like a nightmare to me.. :-(

Irrelevant of all these mess.. For this specific problem, what I meant is
exactly what Michael was requesting too (hopefully), I'd want to avoid
further extending the complexity in this area.  I have a patch attached at
last which I also tested 8.2<->9.0 bi-directional migrations and it worked
for me when I smoked it.  Please have a look to see whether that makes
sense and at the meantime avoid most of the tricks.

I'd also like to mention one more thing just in case this can cause some
more attention to virtio guys..

Normally I ran vmstate-static-checker.py before softfreeze, and I did it
for 9.0 too without seeing this problem.  It isn't raised because all
virtio devices are using the "self managed" VMSTATE_VIRTIO_DEVICE to
migrate.  In that case I am out of luck.  We can further extend what
Fabiano mentioned in the other thread to cover migration stream validations
in the future, but just to mention IMHO that needs extra work, and may work
most likely the same as vmstate static checker but just waste many more cpu
resources.  It'll be good if someone could still help move virtio towards
like most of the rest devices, or at least get covered by the static
checker, too.  But that definitely is a separate topic too.. so we can
address the immediate breakage first.

Thanks,

==8<==
>From a24ef99670fa7102da461d795aed4a957bad86b1 Mon Sep 17 00:00:00 2001
From: Peter Xu 
Date: Fri, 10 May 2024 12:33:34 -0400
Subject: [PATCH] fix gpu

Signed-off-by: Peter Xu 
---
 include/hw/virtio/virtio-gpu.h |  2 +-
 hw/core/machine.c  |  1 +
 hw/display/virtio-gpu.c| 21 +++--
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/hw/virtio/virtio-gpu.h b/include/hw/virtio/virtio-gpu.h
index ed44cdad6b..e128501bdc 100644
--- a/include/hw/virtio/virtio-gpu.h
+++ b/include/hw/virtio/virtio-gpu.h
@@ -176,7 +176,7 @@ typedef struct VGPUDMABuf {
 
 struct VirtIOGPU {
 VirtIOGPUBase parent_obj;
-
+uint8_t vmstate_version;
 uint64_t conf_max_hostmem;
 
 VirtQueue *ctrl_vq;
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 4ff60911e7..8f6f0dda7c 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -42,6 +42,7 @@ GlobalProperty hw_compat_8_2[] = {
 { "migration", "zero-page-detection", "legacy&q

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Peter Xu
On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> That's a good news to see the socket abstraction for RDMA!
> When I was developed the series above, the most pain is the RDMA migration 
> has no QIOChannel abstraction and i need to take a 'fake channel'
> for it which is awkward in code implementation.
> So, as far as I know, we can do this by
> i. the first thing is that we need to evaluate the rsocket is good enough to 
> satisfy our QIOChannel fundamental abstraction
> ii. if it works right, then we will continue to see if it can give us 
> opportunity to hide the detail of rdma protocol
> into rsocket by remove most of code in rdma.c and also some hack in 
> migration main process.
> iii. implement the advanced features like multi-fd and multi-uri for rdma 
> migration.
> 
> Since I am not familiar with rsocket, I need some times to look at it and do 
> some quick verify with rdma migration based on rsocket.
> But, yes, I am willing to involved in this refactor work and to see if we can 
> make this migration feature more better:)

Based on what we have now, it looks like we'd better halt the deprecation
process a bit, so I think we shouldn't need to rush it at least in 9.1
then, and we'll need to see how it goes on the refactoring.

It'll be perfect if rsocket works, otherwise supporting multifd with little
overhead / exported APIs would also be a good thing in general with
whatever approach.  And obviously all based on the facts that we can get
resources from companies to support this feature first.

Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
any of us can provide some test results please do so.  Many people are
saying RDMA is better, but I yet didn't see any numbers comparing it with
modern TCP networks.  I don't want to have old impressions floating around
even if things might have changed..  When we have consolidated results, we
should share them out and also reflect that in QEMU's migration docs when a
rdma document page is ready.

Chuan, please check the whole thread discussion, it may help to understand
what we are looking for on rdma migrations [1].  Meanwhile please feel free
to sync with Jinpu's team and see how to move forward with such a project.

[1] https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/

Thanks,

-- 
Peter Xu




Re: [PATCH 3/3] migration/colo: Tidy up bql_unlock() around bdrv_activate_all()

2024-05-09 Thread Peter Xu
On Thu, May 09, 2024 at 11:31:06AM +0800, Li Zhijian via wrote:
> Make the code more tight.
> 
> Cc: Michael Tokarev 
> Signed-off-by: Li Zhijian 

Reviewed-by: Peter Xu 

> ---
> This change/comment suggested by "Michael Tokarev " came
> a bit late at that time, let's update it together in these minor set
> this time.

You can use a tag next time:

Suggested-by: Michael Tokarev 

> ---
>  migration/colo.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c
> index 991806c06a..1b6d9da1c8 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -838,12 +838,11 @@ static void *colo_process_incoming_thread(void *opaque)
>  /* Make sure all file formats throw away their mutable metadata */
>  bql_lock();
>  bdrv_activate_all(_err);
> +bql_unlock();
>  if (local_err) {
> -bql_unlock();
>  error_report_err(local_err);
>  return NULL;
>  }
> -bql_unlock();
>  
>  failover_init_state();
>  
> -- 
> 2.31.1
> 
> 

-- 
Peter Xu




Re: [PATCH 2/3] migration/colo: make colo_incoming_co() return void

2024-05-09 Thread Peter Xu
On Thu, May 09, 2024 at 11:31:05AM +0800, Li Zhijian via wrote:
> Currently, it always returns 0, no need to check the return value at all.
> In addition, enter colo coroutine only if migration_incoming_colo_enabled()
> is true.
> Once the destination side enters the COLO* state, the COLO process will
> take over the remaining processes until COLO exits.
> 
> Signed-off-by: Li Zhijian 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH 1/3] migration/colo: Minor fix for colo error message

2024-05-09 Thread Peter Xu
On Thu, May 09, 2024 at 11:31:04AM +0800, Li Zhijian wrote:
> - Explicitly show the missing module name: replication
> - Fix capability name to x-colo
> 
> Signed-off-by: Li Zhijian 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2] target/loongarch/kvm: Fix VM recovery from disk failures

2024-05-09 Thread Peter Xu
On Wed, May 08, 2024 at 10:47:32AM +0800, Song Gao wrote:
> vmstate does not save kvm_state_conter,
> which can cause VM recovery from disk to fail.
> 
> Signed-off-by: Song Gao 

Acked-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH 8/9] migration: Add support for fdset with multifd + file

2024-05-08 Thread Peter Xu
c void file_remove_fdset(void)
> >  }
> >  }
> >  
> > +/*
> > + * With multifd, due to the behavior of the dup() system call, we need
> > + * the fdset to have two non-duplicate fds so we can enable direct IO
> > + * in the secondary channels without affecting the main channel.
> > + */
> >  static bool file_parse_fdset(const char *filename, int64_t *fdset_id,
> >   Error **errp)
> >  {
> > +FdsetInfoList *fds_info;
> > +FdsetFdInfoList *fd_info;
> >  const char *fdset_id_str;
> > +int nfds = 0;
> >  
> >  *fdset_id = -1;
> >  
> > @@ -71,6 +80,32 @@ static bool file_parse_fdset(const char *filename, 
> > int64_t *fdset_id,
> >  return false;
> >  }
> >  
> > +if (!migrate_multifd() || !migrate_direct_io()) {
> > +return true;
> > +}
> > +
> > +for (fds_info = qmp_query_fdsets(NULL); fds_info;
> > + fds_info = fds_info->next) {
> > +
> > +if (*fdset_id != fds_info->value->fdset_id) {
> > +continue;
> > +}
> > +
> > +for (fd_info = fds_info->value->fds; fd_info; fd_info = 
> > fd_info->next) {
> > +if (nfds++ > 2) {
> > +break;
> > +}
> > +}
> > +}
> > +
> > +if (nfds != 2) {
> > +error_setg(errp, "Outgoing migration needs two fds in the fdset, "
> > +   "got %d", nfds);
> > +qmp_remove_fd(*fdset_id, false, -1, NULL);
> > +*fdset_id = -1;
> > +return false;
> > +}
> > +
> >  return true;
> >  }
> 
> Related to my thoughts in an earlier patch, where I say that use of fdsets
> ought to be transparent to QEMU code, I'm not a fan of having this logic
> in migration code.
> 
> IIUC, the migration code will call  qio_channel_file_new_path twice,
> once with O_DIRECT and once without. This should trigger two calls
> into monitor_fdset_dup_fd_add with different flags. If we're matching
> flags in that monitor_fdset_dup_fd_add(), then if only 1 FD was
> provided, are we not able to report an error there ?

Right, this sounds working.

For a real sanity check, we may want to somehow check the two fds returned
from qio_channel_file_new_path() to point to the same file underneath.

What mentioned in the other thread (kcmp with KCMP_FILE) might not work, as
the whole purpose of having two fds is to make sure they have different
struct file to back the fd (and only one of them has O_DIRECT).  fstat()
might work in this case over the st_ino field, etc. maybe fstatfs() too but
perhaps that's over cautious.  Just a pain to use two fds as a start..

Thanks,

-- 
Peter Xu




Re: [PATCH 2/9] migration: Fix file migration with fdset

2024-05-08 Thread Peter Xu
On Wed, May 08, 2024 at 09:02:16AM +0100, Daniel P. Berrangé wrote:
> On Fri, May 03, 2024 at 12:23:51PM -0400, Peter Xu wrote:
> > On Fri, Apr 26, 2024 at 11:20:35AM -0300, Fabiano Rosas wrote:
> > > When the migration using the "file:" URI was implemented, I don't
> > > think any of us noticed that if you pass in a file name with the
> > > format "/dev/fdset/N", this allows a file descriptor to be passed in
> > > to QEMU and that behaves just like the "fd:" URI. So the "file:"
> > > support has been added without regard for the fdset part and we got
> > > some things wrong.
> > > 
> > > The first issue is that we should not truncate the migration file if
> > > we're allowing an fd + offset. We need to leave the file contents
> > > untouched.
> > 
> > I'm wondering whether we can use fallocate() instead on the ranges so that
> > we always don't open() with O_TRUNC.  Before that..  could you remind me
> > why do we need to truncate in the first place?  I definitely missed
> > something else here too.
> 
> You're mixing distinct concepts here. fallocate makes a file region
> non-sparse, while O_TRUNC removes all existing allocation, making it
> sparse if we write at non-contiguous offsets. I don't think we would
> want to call fallocate, since we /want/ a sparse file so that we
> don't needlessly store large regions of all-zeros as RAM maps.

I meant fallocate() with FALLOC_FL_PUNCH_HOLE.  But now I think it'll be
good we avoid both.

Thanks,

-- 
Peter Xu




Re: [PATCH 0/4] Fix "virtio-gpu: fix scanout migration post-load"

2024-05-07 Thread Peter Xu
On Tue, May 07, 2024 at 05:46:36PM -0300, Fabiano Rosas wrote:
> marcandre.lur...@redhat.com writes:
> 
> > From: Marc-André Lureau 
> >
> > Hi,
> >
> > The aforementioned patch breaks virtio-gpu device migrations for versions
> > pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
> > complex than it may initially appear, as evidenced in the problematic commit
> > dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
> >
> > To resolve this, we need to propagate the `vmstate` `version_id` through the
> > nested structures. Additionally, we should tie specific machine version to a
> > corresponding `version_id` to maintain migration compatibility.
> >
> > `VMS_VSTRUCT` allows specifying the appropriate version of the nested 
> > structure
> > to use.
> 
> This would have been caught by the migration-compat-x86_64 CI job had we
> added the virtio-gpu device to it.

I had exactly the same thoughts, and actually I added a todo after I
noticed this issue but I forgot to discuss with you today.. actually I have
one more on whether we can allow vmsd versioning to work for bi-direction
(then we can get rid of machine type dependencies), we may need the
handshake work as pre-requisite, so that two qemus need to talk first on
verify the vmsds.  But let's leave that for later.

> 
> $ cd build-8.2
> $ QTEST_TRACE='vmstate_*' QTEST_DEVICE_OPTS='-device virtio-gpu' \
> QTEST_QEMU_BINARY=./qemu-system-x86_64 \
> QTEST_QEMU_BINARY_DST=../build-9.0/qemu-system-x86_64 
> ./tests/qtest/migration-test
> ...
> vmstate_n_elems fb.offset: 1
> vmstate_subsection_load virtio-gpu-one-scanout
> vmstate_subsection_load_good virtio-gpu-one-scanout
> vmstate_load_state_end virtio-gpu-one-scanout end/0
> vmstate_subsection_load virtio-gpu-scanouts
> vmstate_subsection_load_good virtio-gpu-scanouts
> vmstate_load_state_end virtio-gpu-scanouts end/0
> vmstate_subsection_load virtio-gpu
> vmstate_subsection_load_good virtio-gpu
> vmstate_load_state_end virtio-gpu end/0
> vmstate_downtime_load type=non-iterable idstr=:00:03.0/virtio-gpu 
> instance_id=0 downtime=32118
> qemu-system-x86_64: Missing section footer for :00:03.0/virtio-gpu
> vmstate_downtime_checkpoint dst-precopy-loadvm-completed
> qemu-system-x86_64: load of migration failed: Invalid argument
> 
> Some considerations:
> 
> 1) Here QTEST_DEVICE_OPTS is a hack I added on top, it doesn't currently
>exist.
> 
> 2) This only uncovers relatively simple bugs where we don't need the
>guest to access the device, it just needs to be there.

Right, but having something to cover the basics would still be nice,
because the rarer the bug the less impact to users too (I hope!), so they
can be with lower priority too when tested.

> 
> We could take the steps to enable this kind of testing if we think it's
> worthwhile. Some downsides are:
> 
> a) the item (2) above - situations that depend on guest behavior are out
>of the picture because migration-test runs only a custom program that
>dirties memory;
> 
> b) this test only works in CI or in a pre setup environment because it
>needs the previous QEMU version to be built beforehand;
> 
> c) the full set of migration tests already runs a few times in CI via
>make check, plus the compat job. We'll probably need to do some
>simplification to avoid taking too much additional time;
> 
> d) there's also the obvious maintenance burden of choosing devices and
>doing the eventual upkeep of the QEMU command line for the
>migration-test.

The last one should be mostly fine, iiuc - we shouldn't add any device that
can easily break ABI, and the list of devices can start with a minumum; an
extreme case is we add one device only if something broke first with a
stable ABI.

Then if the ABI is stable, we need to go through the deprecation procedure,
and that takes two releases.  We can drop the device in migration-test in
the release of when deprecation starts.

Thanks,

-- 
Peter Xu




Re: [PATCH v10 0/7] Support message-based DMA in vfio-user server

2024-05-07 Thread Peter Xu
On Tue, May 07, 2024 at 07:34:24AM -0700, Mattias Nissler wrote:
> This series adds basic support for message-based DMA in qemu's vfio-user
> server. This is useful for cases where the client does not provide file
> descriptors for accessing system memory via memory mappings. My motivating use
> case is to hook up device models as PCIe endpoints to a hardware design. This
> works by bridging the PCIe transaction layer to vfio-user, and the endpoint
> does not access memory directly, but sends memory requests TLPs to the 
> hardware
> design in order to perform DMA.
> 
> Note that more work is needed to make message-based DMA work well: qemu
> currently breaks down DMA accesses into chunks of size 8 bytes at maximum, 
> each
> of which will be handled in a separate vfio-user DMA request message. This is
> quite terrible for large DMA accesses, such as when nvme reads and writes
> page-sized blocks for example. Thus, I would like to improve qemu to be able 
> to
> perform larger accesses, at least for indirect memory regions. I have 
> something
> working locally, but since this will likely result in more involved surgery 
> and
> discussion, I am leaving this to be addressed in a separate patch.

I assume Jag will pick this up then.

Thanks,

-- 
Peter Xu




Re: [PATCH v10 1/7] system/physmem: Replace qemu_mutex_lock() calls with QEMU_LOCK_GUARD

2024-05-07 Thread Peter Xu
On Tue, May 07, 2024 at 07:34:25AM -0700, Mattias Nissler wrote:
> From: Philippe Mathieu-Daudé 
> 
> From: Philippe Mathieu-Daudé 
> 
> Simplify cpu_[un]register_map_client() and cpu_notify_map_clients()
> by replacing the pair of qemu_mutex_lock/qemu_mutex_unlock calls by
> the WITH_QEMU_LOCK_GUARD() macro.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Signed-off-by: Mattias Nissler 
> Reviewed-by: Mattias Nissler 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v10 4/7] softmmu: Support concurrent bounce buffers

2024-05-07 Thread Peter Xu
On Tue, May 07, 2024 at 07:34:28AM -0700, Mattias Nissler wrote:
> When DMA memory can't be directly accessed, as is the case when
> running the device model in a separate process without shareable DMA
> file descriptors, bounce buffering is used.
> 
> It is not uncommon for device models to request mapping of several DMA
> regions at the same time. Examples include:
>  * net devices, e.g. when transmitting a packet that is split across
>several TX descriptors (observed with igb)
>  * USB host controllers, when handling a packet with multiple data TRBs
>(observed with xhci)
> 
> Previously, qemu only provided a single bounce buffer per AddressSpace
> and would fail DMA map requests while the buffer was already in use. In
> turn, this would cause DMA failures that ultimately manifest as hardware
> errors from the guest perspective.
> 
> This change allocates DMA bounce buffers dynamically instead of
> supporting only a single buffer. Thus, multiple DMA mappings work
> correctly also when RAM can't be mmap()-ed.
> 
> The total bounce buffer allocation size is limited individually for each
> AddressSpace. The default limit is 4096 bytes, matching the previous
> maximum buffer size. A new x-max-bounce-buffer-size parameter is
> provided to configure the limit for PCI devices.
> 
> Signed-off-by: Mattias Nissler 

Acked-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH 3/4] virtio-gpu: use a VMState variant for the scanout field

2024-05-07 Thread Peter Xu
On Tue, May 07, 2024 at 03:19:19PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> Depending on the version, use v1 or v2 of the scanout VM state.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  hw/display/virtio-gpu.c | 22 +-
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c
> index ae831b6b3e..4fd72caf3f 100644
> --- a/hw/display/virtio-gpu.c
> +++ b/hw/display/virtio-gpu.c
> @@ -1191,17 +1191,29 @@ static const VMStateDescription 
> vmstate_virtio_gpu_scanout = {
>  },
>  };
>  
> +static bool vmstate_before_v2(void *opaque, int version)
> +{
> +return version <= 1;
> +}
> +
>  static const VMStateDescription vmstate_virtio_gpu_scanouts = {
>  .name = "virtio-gpu-scanouts",
> -.version_id = 1,
> +.version_id = 2,
> +.minimum_version_id = 1,
>  .fields = (const VMStateField[]) {
>  VMSTATE_INT32(parent_obj.enable, struct VirtIOGPU),
>  VMSTATE_UINT32_EQUAL(parent_obj.conf.max_outputs,
>   struct VirtIOGPU, NULL),
> -VMSTATE_STRUCT_VARRAY_UINT32(parent_obj.scanout, struct VirtIOGPU,
> - parent_obj.conf.max_outputs, 1,
> - vmstate_virtio_gpu_scanout,
> - struct virtio_gpu_scanout),
> +VMSTATE_VSTRUCT_TEST_VARRAY_UINT32(parent_obj.scanout, struct 
> VirtIOGPU,
> +   vmstate_before_v2,
> +   parent_obj.conf.max_outputs, 1,
> +   vmstate_virtio_gpu_scanout,
> +   struct virtio_gpu_scanout, 1),
> +VMSTATE_VSTRUCT_TEST_VARRAY_UINT32(parent_obj.scanout, struct 
> VirtIOGPU,
> +   NULL,
> +   parent_obj.conf.max_outputs, 2,
> +   vmstate_virtio_gpu_scanout,
> +   struct virtio_gpu_scanout, 2),

Personally I really wished struct_version_id never existed..  After these
years we only have 1 user of it (hw/ipmi/isa_ipmi_kcs.c), and we need to
keep that working.  I'm wondering whether there's way we can avoid adding
one more user, and even more complicated..

I think I get the reasoning behind "we define the same thing twice", but
this is really tricky and definitely needs rich documents, meanwhiel all of
these seem to rely on so many small details: one set .field_exists
properly, one leaving it to NULL; also the two versionings used here for
both parent vmsd, and the struct, and for each entry we need to set
different versions to different fields..

Would it work if we only make the new fields under control with
vmstate_before_v2()?  IOW, making all below with
.field_exists=vmstate_before_v2, so skip below when machine type is old?

 VMSTATE_UINT32_V(fb.format, struct virtio_gpu_scanout, 2),
 VMSTATE_UINT32_V(fb.bytes_pp, struct virtio_gpu_scanout, 2),
 VMSTATE_UINT32_V(fb.width, struct virtio_gpu_scanout, 2),
 VMSTATE_UINT32_V(fb.height, struct virtio_gpu_scanout, 2),
 VMSTATE_UINT32_V(fb.stride, struct virtio_gpu_scanout, 2),
 VMSTATE_UINT32_V(fb.offset, struct virtio_gpu_scanout, 2),

Thanks,

-- 
Peter Xu




  1   2   3   4   5   6   7   8   9   10   >