Re: [PATCH] KVM: Add system call KVM_VERIFY_MSI to verify MSI vector

2022-11-14 Thread chenxiang (M)

Hi Marc,


在 2022/11/10 18:28, Marc Zyngier 写道:

On Wed, 09 Nov 2022 06:21:18 +,
"chenxiang (M)"  wrote:

Hi Marc,


在 2022/11/8 20:47, Marc Zyngier 写道:

On Tue, 08 Nov 2022 08:08:57 +,
chenxiang  wrote:

From: Xiang Chen 

Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS
which should be power-of-2, but in some scenaries it is not the same as
the number that driver requires in guest, for example, a PCI driver wants
to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
guest only wants to allocate 6 MSI vectors.

When GICv4.1 is enabled, we can see some exception print as following for
above scenaro:
vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
registration fails:66311

In order to verify whether a MSI vector is valid, add KVM_VERIFY_MSI to do
that. If there is a mapping, return 0, otherwise return negative value.

This is the kernel part of adding system call KVM_VERIFY_MSI.

Exposing something that is an internal implementation detail to
userspace feels like the absolute wrong way to solve this issue.

Can you please characterise the issue you're having? Is it that vfio
tries to enable an interrupt for which there is no virtual ITS
mapping? Shouldn't we instead try and manage this in the kernel?

Before i reported the issue to community, you gave a suggestion about
the issue, but not sure whether i misundertood your meaning.
You can refer to the link for more details about the issue.
https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/

Right. It would have been helpful to mention this earlier. Anyway, I
would really like this to be done without involving userspace at all.

But first, can you please confirm that the VM works as expected
despite the message?

Yes, it works well except the message.


If that's the case, we only need to handle the
case where this is a multi-MSI setup, and I think this can be done in
VFIO, without involving userspace.


It seems we can verify every kvm_msi for multi-MSI setup in function 
vfio_pci_set_msi_trigger().
If it is a invalid MSI vector, then we can decrease the numer of MSI 
vectors before  calling vfio_msi_set_block().




Thanks,

M.






Re: [PATCH v2] util/qemu-config: Fix "query-command-line-options" to provide the right values

2022-11-14 Thread Markus Armbruster
Thomas Huth  writes:

> On 11/11/2022 15.53, Markus Armbruster wrote:
>> Thomas Huth  writes:
>> 
>>> The "query-command-line-options" command uses a hand-crafted list
>>> of options that should be returned for the "machine" parameter.
>>> This is pretty much out of sync with reality, for example settings
>>> like "kvm_shadow_mem" or "accel" are not parameters for the machine
>>> anymore. Also, there is no distinction between the targets here, so
>>> e.g. the s390x-specific values like "loadparm" in this list also
>>> show up with the other targets like x86_64.
>>>
>>> Let's fix this now by geting rid of the hand-crafted list and by
>>> querying the properties of the machine classes instead to assemble
>>> the list.
>> Do we know what uses this command, and how these users are
>> inconvenienced by the flaw you're fixing?
>> I'm asking because the command is pretty much out of sync with reality
>> by (mis-)design.
>
> libvirt apparently queries this data (see the various 
> tests/qemucapabilitiesdata/*.replies files in their repository), but since 
> it's so much out-of-sync with reality, it's not of a big use there yet.
>
> See for example here:
>
> https://lists.gnu.org/archive/html/qemu-devel/2021-12/msg00581.html
>
> If we finally fix this problem with "query-command-line-options" in QEMU, it 
> should be much easier to deprecate -no-hpet in QEMU, too.

For a value of "fix".  While we can fix certain concrete issues with
q-c-l-o, which may be wortwhile, the overarching issue is (in my
opinion) unfixable: it can only tell us about QemuOpts.

QemuOpts is only part of the truth.  Last time I checked, it worked for
one out of five CLI options.

Moreover, our needs have long outgrown QemuOpts' design limitations,
which has led to a bunch of bolted-on hacks, none of them represented in
q-c-l-o.




Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-14 Thread Christian Borntraeger




Am 14.11.22 um 18:20 schrieb Michael S. Tsirkin:
[...]




This does not seem to fix the regression that I have reported.


This was applied on top of 9f6bcfd99f which IIUC does, right?


Just dobble checked,

9f6bcfd99f was the patch that created the original problem, no?



Re: [qemu-devel]

2022-11-14 Thread Thomas Huth

On 14/11/2022 23.58, Pawel Polawski wrote:

Hi Everyone,

I am trying to check qemu virtual cpu boundaries when running a custom
edk2 based firmware build. For that purpose I want to run qemu with more 
than 1024 vCPU:

$QEMU
-accel kvm
-m 4G
-M q35,kernel-irqchip=on,smm=on
-smp cpus=1025,maxcpus=1025 -global mch.extended-tseg-mbytes=128
-drive if=pflash,format=raw,file=${CODE},readonly=on
-drive if=pflash,format=raw,file=${VARS}
-chardev stdio,id=fwlog
-device isa-debugcon,iobase=0x402,chardev=fwlog "$@"

The result is as follows:
QEMU emulator version 7.0.50 (v7.0.0-1651-g9cc1bf1ebc-dirty)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested (1025) 
exceeds the recommended cpus supported by KVM (8)
Number of SMP cpus requested (1025) exceeds the maximum cpus supported by 
KVM (1024)


It is not clear to me if I am hitting qemu limitation or KVM limitation here.
I have changed hardcoded 1024 limits in hw/i386/* files but the limitation 
is still presented.


Can someone advise what I should debug next looking for those vCPU limits?


Well, the error message says it: There is a limitation in KVM, i.e. in the 
kernel code, too. I think it is KVM_MAX_VCPUS in the file 
arch/x86/include/asm/kvm_host.h of the Linux kernel sources... so if you're 
brave, you might want to increase that value there and rebuild your own 
kernel. Not sure whether that works, though.


 Thomas




Re: [PATCH for-7.2 v2 3/4] libvhost-user: Fix two more format strings

2022-11-14 Thread Stefan Weil via

Am 05.11.22 um 11:24 schrieb Stefan Weil:


This fix is required for 32 bit host. The bug was detected by CI
for arm-linux, but is also relevant for i386-linux.

Reported-by: Stefan Hajnoczi 
Signed-off-by: Stefan Weil 
---
  subprojects/libvhost-user/libvhost-user.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index d67953a1c3..80f9952e71 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -651,7 +651,8 @@ generate_faults(VuDev *dev) {
  
  if (ioctl(dev->postcopy_ufd, UFFDIO_REGISTER, _struct)) {

  vu_panic(dev, "%s: Failed to userfault region %d "
-  "@%" PRIx64 " + size:%zx offset: %zx: (ufd=%d)%s\n",
+  "@%" PRIx64 " + size:%" PRIx64 " offset: %" PRIx64
+  ": (ufd=%d)%s\n",
   __func__, i,
   dev_region->mmap_addr,
   dev_region->size, dev_region->mmap_offset,



There is still no review for this patch.

I added "for-7.2" to the subject here in my answer. How can we get all 4 
patches of this series into the new release?


Stefan





[PATCH for-7.2] Add G_GNUC_PRINTF to function qemu_set_info_str and fix related issues

2022-11-14 Thread Stefan Weil via
With the G_GNUC_PRINTF function attribute the compiler detects
two potential insecure format strings:

../../../net/stream.c:248:31: warning: format string is not a string literal 
(potentially insecure) [-Wformat-security]
qemu_set_info_str(>nc, uri);
  ^~~
../../../net/stream.c:322:31: warning: format string is not a string literal 
(potentially insecure) [-Wformat-security]
qemu_set_info_str(>nc, uri);
  ^~~

There are also two other warnings:

../../../net/socket.c:182:35: warning: zero-length gnu_printf format string 
[-Wformat-zero-length]
  182 | qemu_set_info_str(>nc, "");
  |   ^~
../../../net/stream.c:170:35: warning: zero-length gnu_printf format string 
[-Wformat-zero-length]
  170 | qemu_set_info_str(>nc, "");

Signed-off-by: Stefan Weil 
---
 include/net/net.h | 3 ++-
 net/socket.c  | 2 +-
 net/stream.c  | 6 +++---
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/net/net.h b/include/net/net.h
index 3db75ff841..dc20b31e9f 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -177,7 +177,8 @@ ssize_t qemu_send_packet_async(NetClientState *nc, const 
uint8_t *buf,
 void qemu_purge_queued_packets(NetClientState *nc);
 void qemu_flush_queued_packets(NetClientState *nc);
 void qemu_flush_or_purge_queued_packets(NetClientState *nc, bool purge);
-void qemu_set_info_str(NetClientState *nc, const char *fmt, ...);
+void qemu_set_info_str(NetClientState *nc,
+   const char *fmt, ...) G_GNUC_PRINTF(2, 3);
 void qemu_format_nic_info_str(NetClientState *nc, uint8_t macaddr[6]);
 bool qemu_has_ufo(NetClientState *nc);
 bool qemu_has_vnet_hdr(NetClientState *nc);
diff --git a/net/socket.c b/net/socket.c
index 4944bb70d5..e62137c839 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -179,7 +179,7 @@ static void net_socket_send(void *opaque)
 s->fd = -1;
 net_socket_rs_init(>rs, net_socket_rs_finalize, false);
 s->nc.link_down = true;
-qemu_set_info_str(>nc, "");
+qemu_set_info_str(>nc, "%s", "");
 
 return;
 }
diff --git a/net/stream.c b/net/stream.c
index 53b7040cc4..37ff727e0c 100644
--- a/net/stream.c
+++ b/net/stream.c
@@ -167,7 +167,7 @@ static gboolean net_stream_send(QIOChannel *ioc,
 
 net_socket_rs_init(>rs, net_stream_rs_finalize, false);
 s->nc.link_down = true;
-qemu_set_info_str(>nc, "");
+qemu_set_info_str(>nc, "%s", "");
 
 qapi_event_send_netdev_stream_disconnected(s->nc.name);
 
@@ -245,7 +245,7 @@ static void net_stream_listen(QIONetListener *listener,
 }
 g_assert(addr != NULL);
 uri = socket_uri(addr);
-qemu_set_info_str(>nc, uri);
+qemu_set_info_str(>nc, "%s", uri);
 g_free(uri);
 qapi_event_send_netdev_stream_connected(s->nc.name, addr);
 qapi_free_SocketAddress(addr);
@@ -319,7 +319,7 @@ static void net_stream_client_connected(QIOTask *task, 
gpointer opaque)
 addr = qio_channel_socket_get_remote_address(sioc, NULL);
 g_assert(addr != NULL);
 uri = socket_uri(addr);
-qemu_set_info_str(>nc, uri);
+qemu_set_info_str(>nc, "%s", uri);
 g_free(uri);
 
 ret = qemu_socket_try_set_nonblock(sioc->fd);
-- 
2.30.2




Re: [PATCH] migration: check magic value for deciding the mapping of channels

2022-11-14 Thread manish.mishra

Thanks Peter

On 14/11/22 10:21 pm, Peter Xu wrote:

Manish,

On Thu, Nov 03, 2022 at 11:47:51PM +0530, manish.mishra wrote:

Yes, but if we try to read early on main channel with tls enabled case it is an 
issue. Sorry i may not have put above comment cleary. I will try to put 
scenario step wise.
1. main channel is created and tls handshake is done for main channel.
2. Destionation side tries to read magic early on main channel in 
migration_ioc_process_incoming but it is not yet sent by source.
3. Source has written magic to main channel file buffer but it is not yet 
flushed, it is flushed first time in ram_save_setup, i mean data is sent on 
channel only if qemu file buffer is full or explicitly flushed.
4. Source side blocks on multifd_send_sync_main in ram_save_setup before 
flushing qemu file. But multifd_send_sync_main is blocked for sem_sync until 
handshake is done for multiFD channels.
5. Destination side is still waiting for reading magic on main channel, so 
unless we return from migration_ioc_process_incoming we can not accept new 
channel, so handshake of multiFD channel is blocked.
6. So basically source is blocked on multiFD channels handshake before sending 
data on main channel, but destination is blocked waiting for data before it can 
acknowledge multiFD channels and do handshake, so it kind of creates a deadlock 
situation.

Why is this issue only happening with TLS?  It sounds like it'll happen as
long as multifd enabled.



Actually this was happening with tls because with tls we do handshake, so a 
connection is assumed establised only after a tls handshake and we flush data 
from source only after all channels are established,  but with normal live 
migration even if connection is not accepted on destination side we can 
continue as we do not do any handshake. Basically in normal live migration a 
connection is assumed established if connect() call was successful even if it 
is not accepted/ack by destination, so that's why this deadlock was not 
hapening.



I'm also thinking whether we should flush in qemu_savevm_state_header() so
at least upgraded src qemu will always flush the headers if it never hurts.


yes sure Peter.



Thanks

Manish Mishra





Re: [PATCH] hw/intc: sifive_plic: Renumber the S irqs for numa support

2022-11-14 Thread Alistair Francis
On Tue, Nov 15, 2022 at 9:56 AM Philippe Mathieu-Daudé
 wrote:
>
> On 14/11/22 11:34, Frédéric Pétrot wrote:
> > Le 14/11/2022 à 09:40, Philippe Mathieu-Daudé a écrit :
> >> On 11/11/22 13:19, Frédéric Pétrot wrote:
>
>
> >> Eventually we could unify the style:
> >>
> >> -- >8 --
> >> @@ -476,11 +476,11 @@ DeviceState *sifive_plic_create(hwaddr addr,
> >> char *hart_config,
> >>   CPUState *cpu = qemu_get_cpu(cpu_num);
> >>
> >>   if (plic->addr_config[i].mode == PLICMode_M) {
> >> -qdev_connect_gpio_out(dev, num_harts - plic->hartid_base
> >> + cpu_num,
> >> +qdev_connect_gpio_out(dev, cpu_num - hartid_base +
> >> num_harts,
> >> qdev_get_gpio_in(DEVICE(cpu),
> >> IRQ_M_EXT));
> >>   }
> >>   if (plic->addr_config[i].mode == PLICMode_S) {
> >> -qdev_connect_gpio_out(dev, cpu_num,
> >> +qdev_connect_gpio_out(dev, cpu_num - hartid_base,hartid_base
> >> qdev_get_gpio_in(DEVICE(cpu),
> >> IRQ_S_EXT));
> >>   }
> >>   }
> >> ---
> >
> >IIUC hartid_base is used to set plic->hartid_base, so agreed, along
> > with the
> >style unification.
> >I'll send a v2, then.
> >Since Alistair already queued the patch, how shall I proceed?
>
> I didn't notice Alistair queued (he usually send a notification by
> responding "queued" to the patches). If it is queued, then too late
> (and not a big deal) -- you can still post the v2 and let him pick
> it :)

Yep! I'll drop it from my queue and take the v2

Alistair

>



Re: biosbits test failing on origin/master

2022-11-14 Thread Ani Sinha
On Tue, Nov 15, 2022 at 5:13 AM John Snow  wrote:
>
> On Thu, Nov 10, 2022 at 11:22 PM Ani Sinha  wrote:
> >
> > On Thu, Nov 10, 2022 at 11:37 PM John Snow  wrote:
> > >
> > > Hiya, on today's origin/master
> > > (2ccad61746ca7de5dd3e25146062264387e43bd4) I'm finding that "make
> > > check-avocado" is failing on the new biosbits test on my local
> > > development machine:
> > >
> > >  (001/193) tests/avocado/acpi-bits.py:AcpiBitsTest.test_acpi_smbios_bits:
> > > FAIL: True is not false : The VM seems to have failed to shutdown in
> > > time (83.65 s)
> > >
> > > Is this a known issue, or should I begin to investigate it?
> >
> > In my test environment it does pass.
> >
> > $ ./tests/venv/bin/avocado run -t acpi tests/avocado
> > Fetching asset from
> > tests/avocado/acpi-bits.py:AcpiBitsTest.test_acpi_smbios_bits
> > JOB ID : 35726df7d3c2e0f41847822620c78195ba45b9b9
> > JOB LOG: 
> > /home/anisinha/avocado/job-results/job-2022-11-11T09.42-35726df/job.log
> >  (1/1) tests/avocado/acpi-bits.py:AcpiBitsTest.test_acpi_smbios_bits:
> > PASS (57.57 s)
> > RESULTS: PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0
> > | CANCEL 0
> > JOB TIME   : 63.82 s
> >
> > However, I have seen that on certain slower test machines or when run
> > within a virtual machine, the test can take longer to complete and 60
> > secs may not always be enough. In those cases raising the maximum
> > completion time to 90 secs helps. Perhaps you can try this and let me
> > know if it helps:
>
> Hmm - I'm running on a fairly modern machine and not in a VM. Do you
> have an invocation to share that exists outside of the avocado
> machinery

If you pass V=1 in the environment then it dumps the QEMU command line
that was used to run the test. You also need to comment out the line
> shutil.rmtree(self._workDir)
in tearDown() so that the iso is not cleaned up.

diff --git a/tests/avocado/acpi-bits.py b/tests/avocado/acpi-bits.py
index a67d30d583..2060e3b84f 100644
--- a/tests/avocado/acpi-bits.py
+++ b/tests/avocado/acpi-bits.py
@@ -357,7 +357,7 @@ def tearDown(self):
 if self._vm:
 self.assertFalse(not self._vm.is_running)
 self.logger.info('removing the work directory %s', self._workDir)
-shutil.rmtree(self._workDir)
+# shutil.rmtree(self._workDir)
 super().tearDown()

 def test_acpi_smbios_bits(self):

while you are at it, it might makes sense to check the vnc for the VM
to see what it is doing.

 where I could test this individually and see how long it
> might take to complete if I just let it run? I am worried that it's
> getting wedged instead of just taking a long time, but it's hard to
> tell.
>
> --js
>
> >
> > diff --git a/tests/avocado/acpi-bits.py b/tests/avocado/acpi-bits.py
> > index 8745a58a76..b11fe39350 100644
> > --- a/tests/avocado/acpi-bits.py
> > +++ b/tests/avocado/acpi-bits.py
> > @@ -385,8 +385,9 @@ def test_acpi_smbios_bits(self):
> >  self._vm.launch()
> >  # biosbits has been configured to run all the specified test suites
> >  # in batch mode and then automatically initiate a vm shutdown.
> > -# sleep for maximum of one minute
> > -max_sleep_time = time.monotonic() + 60
> > +# sleep for a maximum of one and half minutes to accommodate
> > running this
> > +# even on slower machines.
> > +max_sleep_time = time.monotonic() + 90
> >  while self._vm.is_running() and time.monotonic() < max_sleep_time:
> >  time.sleep(1)
> >
>



Re: [PATCH v6 09/10] vdpa: Add listener_shadow_vq to vhost_vdpa

2022-11-14 Thread Jason Wang
On Tue, Nov 15, 2022 at 12:31 AM Eugenio Perez Martin
 wrote:
>
> On Mon, Nov 14, 2022 at 5:30 AM Jason Wang  wrote:
> >
> >
> > 在 2022/11/11 21:12, Eugenio Perez Martin 写道:
> > > On Fri, Nov 11, 2022 at 8:49 AM Jason Wang  wrote:
> > >>
> > >> 在 2022/11/10 21:47, Eugenio Perez Martin 写道:
> > >>> On Thu, Nov 10, 2022 at 7:01 AM Jason Wang  wrote:
> >  On Wed, Nov 9, 2022 at 1:08 AM Eugenio Pérez  
> >  wrote:
> > > The memory listener that thells the device how to convert GPA to 
> > > qemu's
> > > va is registered against CVQ vhost_vdpa. This series try to map the
> > > memory listener translations to ASID 0, while it maps the CVQ ones to
> > > ASID 1.
> > >
> > > Let's tell the listener if it needs to register them on iova tree or
> > > not.
> > >
> > > Signed-off-by: Eugenio Pérez 
> > > ---
> > > v5: Solve conflict about vhost_iova_tree_remove accepting mem_region 
> > > by
> > >   value.
> > > ---
> > >include/hw/virtio/vhost-vdpa.h | 2 ++
> > >hw/virtio/vhost-vdpa.c | 6 +++---
> > >net/vhost-vdpa.c   | 1 +
> > >3 files changed, 6 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/hw/virtio/vhost-vdpa.h 
> > > b/include/hw/virtio/vhost-vdpa.h
> > > index 6560bb9d78..0c3ed2d69b 100644
> > > --- a/include/hw/virtio/vhost-vdpa.h
> > > +++ b/include/hw/virtio/vhost-vdpa.h
> > > @@ -34,6 +34,8 @@ typedef struct vhost_vdpa {
> > >struct vhost_vdpa_iova_range iova_range;
> > >uint64_t acked_features;
> > >bool shadow_vqs_enabled;
> > > +/* The listener must send iova tree addresses, not GPA */
> > >>
> > >> Btw, cindy's vIOMMU series will make it not necessarily GPA any more.
> > >>
> > > Yes, this comment should be tuned then. But the SVQ iova_tree will not
> > > be equal to vIOMMU one because shadow vrings.
> > >
> > > But maybe SVQ can inspect both instead of having all the duplicated 
> > > entries.
> > >
> > > +bool listener_shadow_vq;
> > >/* IOVA mapping used by the Shadow Virtqueue */
> > >VhostIOVATree *iova_tree;
> > >GPtrArray *shadow_vqs;
> > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > index 8fd32ba32b..e3914fa40e 100644
> > > --- a/hw/virtio/vhost-vdpa.c
> > > +++ b/hw/virtio/vhost-vdpa.c
> > > @@ -220,7 +220,7 @@ static void 
> > > vhost_vdpa_listener_region_add(MemoryListener *listener,
> > > vaddr, section->readonly);
> > >
> > >llsize = int128_sub(llend, int128_make64(iova));
> > > -if (v->shadow_vqs_enabled) {
> > > +if (v->listener_shadow_vq) {
> > >int r;
> > >
> > >mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
> > > @@ -247,7 +247,7 @@ static void 
> > > vhost_vdpa_listener_region_add(MemoryListener *listener,
> > >return;
> > >
> > >fail_map:
> > > -if (v->shadow_vqs_enabled) {
> > > +if (v->listener_shadow_vq) {
> > >vhost_iova_tree_remove(v->iova_tree, mem_region);
> > >}
> > >
> > > @@ -292,7 +292,7 @@ static void 
> > > vhost_vdpa_listener_region_del(MemoryListener *listener,
> > >
> > >llsize = int128_sub(llend, int128_make64(iova));
> > >
> > > -if (v->shadow_vqs_enabled) {
> > > +if (v->listener_shadow_vq) {
> > >const DMAMap *result;
> > >const void *vaddr = memory_region_get_ram_ptr(section->mr) 
> > > +
> > >section->offset_within_region +
> > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> > > index 85a318faca..02780ee37b 100644
> > > --- a/net/vhost-vdpa.c
> > > +++ b/net/vhost-vdpa.c
> > > @@ -570,6 +570,7 @@ static NetClientState 
> > > *net_vhost_vdpa_init(NetClientState *peer,
> > >s->vhost_vdpa.index = queue_pair_index;
> > >s->always_svq = svq;
> > >s->vhost_vdpa.shadow_vqs_enabled = svq;
> > > +s->vhost_vdpa.listener_shadow_vq = svq;
> >  Any chance those above two can differ?
> > 
> > >>> If CVQ is shadowed but data VQs are not, shadow_vqs_enabled is true
> > >>> but listener_shadow_vq is not.
> > >>>
> > >>> It is more clear in the next commit, where only shadow_vqs_enabled is
> > >>> set to true at vhost_vdpa_net_cvq_start.
> > >>
> > >> Ok, the name looks a little bit confusing. I wonder if it's better to
> > >> use shadow_cvq and shadow_data ?
> > >>
> > > I'm ok with renaming it, but struct vhost_vdpa is generic across all
> > > kind of devices, and it does not know if it is a datapath or not for
> > > the moment.
> > >
> > > Maybe listener_uses_iova_tree?
> >
> >
> > I think "iova_tree" is something that is internal to svq implementation,
> > it's better to define the name from the view 

Re: [PATCH v1 1/1] migration: Fix yank on postcopy multifd crashing guest after migration

2022-11-14 Thread Leonardo Bras Soares Passos
On Thu, Nov 10, 2022 at 10:48 AM Juan Quintela  wrote:
>
> Leonardo Bras  wrote:
> D> When multifd and postcopy-ram capabilities are enabled, if a
> > migrate-start-postcopy is attempted, the migration will finish sending the
> > memory pages and then crash with the following error:
> >
> > qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance: Assertion
> > `QLIST_EMPTY(>yankfns)' failed.
> >
> > This happens because even though all multifd channels could
> > yank_register_function(), none of them could unregister it before
> > unregistering the MIGRATION_YANK_INSTANCE, causing the assert to fail.
> >
> > Fix that by calling multifd_load_cleanup() on postcopy_ram_listen_thread()
> > before MIGRATION_YANK_INSTANCE is unregistered.
>
> Hi
>
> One question,
> What warantees that migration_load_cleanup() is not called twice?
>
> I can't see anything that provides that here?  Or does postcopy have
> never done the cleanup of multifd channels before?

IIUC, postcopy is not doing multifd cleanup for a while, at least
since 6.0.0-rc2.
That is as far as I went back testing, and by fixing other (build)
bugs, I could get the yank to abort the target qemu after the
migration finished on multifd + postcopy scenario.


>
> Later, Juan.
>
>
> > Fixes: b5eea99ec2 ("migration: Add yank feature")
> > Reported-by: Li Xiaohui 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/migration.h |  1 +
> >  migration/migration.c | 18 +-
> >  migration/savevm.c|  2 ++
> >  3 files changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/migration/migration.h b/migration/migration.h
> > index cdad8aceaa..240f64efb0 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -473,6 +473,7 @@ void migration_make_urgent_request(void);
> >  void migration_consume_urgent_request(void);
> >  bool migration_rate_limit(void);
> >  void migration_cancel(const Error *error);
> > +bool migration_load_cleanup(void);
> >
> >  void populate_vfio_info(MigrationInfo *info);
> >  void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 739bb683f3..4f363b2a95 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -486,6 +486,17 @@ void migrate_add_address(SocketAddress *address)
> >QAPI_CLONE(SocketAddress, address));
> >  }
> >
> > +bool migration_load_cleanup(void)
> > +{
> > +Error *local_err = NULL;
> > +
> > +if (multifd_load_cleanup(_err)) {
> > +error_report_err(local_err);
> > +return true;
> > +}
> > +return false;
> > +}
> > +
> >  static void qemu_start_incoming_migration(const char *uri, Error **errp)
> >  {
> >  const char *p = NULL;
> > @@ -540,8 +551,7 @@ static void process_incoming_migration_bh(void *opaque)
> >   */
> >  qemu_announce_self(>announce_timer, migrate_announce_params());
> >
> > -if (multifd_load_cleanup(_err) != 0) {
> > -error_report_err(local_err);
> > +if (migration_load_cleanup()) {
> >  autostart = false;
> >  }
> >  /* If global state section was not received or we are in running
> > @@ -646,9 +656,7 @@ fail:
> >  migrate_set_state(>state, MIGRATION_STATUS_ACTIVE,
> >MIGRATION_STATUS_FAILED);
> >  qemu_fclose(mis->from_src_file);
> > -if (multifd_load_cleanup(_err) != 0) {
> > -error_report_err(local_err);
> > -}
> > +migration_load_cleanup();
> >  exit(EXIT_FAILURE);
> >  }
> >
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index a0cdb714f7..250caff7f4 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1889,6 +1889,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
> >  exit(EXIT_FAILURE);
> >  }
> >
> > +migration_load_cleanup();
> > +
>
> This addition is the one that I don't understand why it was not
> needed/done before.

Please see the above comment, but tl;dr, it was not done before.


Thanks you for reviewing,
Leo

>
> >  migrate_set_state(>state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > MIGRATION_STATUS_COMPLETED);
> >  /*
>
> Later, Juan.
>




Re: Re: [PATCH for 8.0 8/8] hmp: add cryptodev info command

2022-11-14 Thread zhenwei pi




On 11/15/22 02:16, Dr. David Alan Gilbert wrote:

* zhenwei pi (pizhen...@bytedance.com) wrote:

Example of this command:
  # virsh qemu-monitor-command vm --hmp info cryptodev
cryptodev1: service=[akcipher|mac|hash|cipher]
 queue 0: type=builtin
cryptodev0: service=[akcipher]
 queue 0: type=lkcf

Signed-off-by: zhenwei pi 
---
  hmp-commands-info.hx  | 14 ++
  include/monitor/hmp.h |  1 +
  monitor/hmp-cmds.c| 36 
  3 files changed, 51 insertions(+)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 754b1e8408..47d63d26db 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -993,3 +993,17 @@ SRST
``info virtio-queue-element`` *path* *queue* [*index*]
  Display element of a given virtio queue
  ERST
+
+{
+.name   = "cryptodev",
+.args_type  = "",
+.params = "",
+.help   = "show the crypto devices",
+.cmd= hmp_info_cryptodev,
+.flags  = "p",
+},
+
+SRST
+  ``info cryptodev``
+Show the crypto devices.
+ERST
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index dfbc0c9a2f..b6b2b49202 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -143,5 +143,6 @@ void hmp_info_vcpu_dirty_limit(Monitor *mon, const QDict 
*qdict);
  void hmp_human_readable_text_helper(Monitor *mon,
  HumanReadableText *(*qmp_handler)(Error 
**));
  void hmp_info_stats(Monitor *mon, const QDict *qdict);
+void hmp_info_cryptodev(Monitor *mon, const QDict *qdict);
  
  #endif

diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 01b789a79e..3f1054aa1e 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -33,6 +33,7 @@
  #include "qapi/qapi-commands-block.h"
  #include "qapi/qapi-commands-char.h"
  #include "qapi/qapi-commands-control.h"
+#include "qapi/qapi-commands-cryptodev.h"
  #include "qapi/qapi-commands-machine.h"
  #include "qapi/qapi-commands-migration.h"
  #include "qapi/qapi-commands-misc.h"
@@ -2761,3 +2762,38 @@ void hmp_virtio_queue_element(Monitor *mon, const QDict 
*qdict)
  
  qapi_free_VirtioQueueElement(e);

  }
+
+void hmp_info_cryptodev(Monitor *mon, const QDict *qdict)
+{
+CryptodevInfoList *info_list;
+CryptodevInfo *info;
+QCryptodevBackendServiceTypeList *service_list;
+CryptodevBackendClientList *client_list;
+CryptodevBackendClient *client;
+char services[128] = {};


I'd rather avoid magic length buffers; the magic is always the wrong
number!


+int len;
+
+info_list = qmp_query_cryptodev(NULL);
+for ( ; info_list; info_list = info_list->next) {


maybe:
  for ( info_list = qmp_query_cryptodev(NULL);
info_list;
info_list = info_list->next) {


+info = info_list->value;
+
+service_list = info->service;
+for (len = 0; service_list; service_list = service_list->next) {
+len += snprintf(services + len, sizeof(services) - len, "%s|",
+QCryptodevBackendServiceType_str(service_list->value));


Consider using a dynamically allocated string and then just using
g_strconcat or g_strjoin() to glue them all together.

 new_services = g_strjoin("|", services,  NULL);   ?
 g_free(services);
 services = new_services;

Maybe?


Hi, I'll fix these in the next version, thanks!




+}
+if (len) {
+services[len - 1] = '\0'; /* strip last char '|' */
+}
+monitor_printf(mon, "%s: service=[%s]\n", info->id, services);
+
+client_list = info->client;
+for ( ; client_list; client_list = client_list->next) {
+client = client_list->value;
+monitor_printf(mon, "queue %ld: type=%s\n", client->queue,
+  QCryptodevBackendType_str(client->type));
+}
+}
+
+qapi_free_CryptodevInfoList(info_list);
+}
--
2.20.1



--
zhenwei pi



Re: aarch64 GitLab CI runner down

2022-11-14 Thread Stefan Hajnoczi
On Mon, 14 Nov 2022 at 19:39, Philippe Mathieu-Daudé  wrote:
> On 14/11/22 22:56, Stefan Hajnoczi wrote:
> > Hi Alex and Richard,
> > The aarch64 GitLab CI runner is down again. Are you able to restart it?
>
> Alex wrote to your RH account :/ This is a scheduled shutdown.
> Equinix is relocating the hardware and this runner will be down
> for (at least) 4 days.

Thanks for letting me know. I hadn't seen the email yet.

> > Any idea why it disconnects sometimes?
>
> Sometimes? Odd. Can we check on the GitLab concentrator logs when
> the runner is unreachable?

The runner was offline recently and Alex fixed it.

Stefan



Re: [PATCH 2/4] hw/cxl: Add CXL_CAPACITY_MULTIPLIER definition

2022-11-14 Thread Jonathan Cameron via
On Tue, 25 Oct 2022 20:47:35 -0400
Gregory Price  wrote:

> Remove usage of magic numbers when accessing capacity fields and replace
> with CXL_CAPACITY_MULTIPLIER, matching the kernel definition.
> 
> Signed-off-by: Gregory Price 
> Reviewed-by: Davidlohr Bueso 
Makes sense. I'll add it to my local tree.

Depending on how timing works out, I might roll this into a wider series
of cleanups sent to Michael or we might want to keep this as part of your
series.

Jonathan
 
> ---
>  hw/cxl/cxl-mailbox-utils.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> index 3e23d29e2d..d7543fd5b4 100644
> --- a/hw/cxl/cxl-mailbox-utils.c
> +++ b/hw/cxl/cxl-mailbox-utils.c
> @@ -15,6 +15,8 @@
>  #include "qemu/log.h"
>  #include "qemu/uuid.h"
>  
> +#define CXL_CAPACITY_MULTIPLIER   0x1000 /* SZ_256M */
> +
>  /*
>   * How to add a new command, example. The command set FOO, with cmd BAR.
>   *  1. Add the command set and cmd to the enum.
> @@ -267,7 +269,7 @@ static ret_code cmd_firmware_update_get_info(struct 
> cxl_cmd *cmd,
>  } QEMU_PACKED *fw_info;
>  QEMU_BUILD_BUG_ON(sizeof(*fw_info) != 0x50);
>  
> -if (cxl_dstate->pmem_size < (256 << 20)) {
> +if (cxl_dstate->pmem_size < CXL_CAPACITY_MULTIPLIER) {
>  return CXL_MBOX_INTERNAL_ERROR;
>  }
>  
> @@ -412,7 +414,7 @@ static ret_code cmd_identify_memory_device(struct cxl_cmd 
> *cmd,
>  CXLType3Class *cvc = CXL_TYPE3_GET_CLASS(ct3d);
>  uint64_t size = cxl_dstate->pmem_size;
>  
> -if (!QEMU_IS_ALIGNED(size, 256 << 20)) {
> +if (!QEMU_IS_ALIGNED(size, CXL_CAPACITY_MULTIPLIER)) {
>  return CXL_MBOX_INTERNAL_ERROR;
>  }
>  
> @@ -422,8 +424,8 @@ static ret_code cmd_identify_memory_device(struct cxl_cmd 
> *cmd,
>  /* PMEM only */
>  snprintf(id->fw_revision, 0x10, "BWFW VERSION %02d", 0);
>  
> -id->total_capacity = size / (256 << 20);
> -id->persistent_capacity = size / (256 << 20);
> +id->total_capacity = size / CXL_CAPACITY_MULTIPLIER;
> +id->persistent_capacity = size / CXL_CAPACITY_MULTIPLIER;
>  id->lsa_size = cvc->get_lsa_size(ct3d);
>  id->poison_list_max_mer[1] = 0x1; /* 256 poison records */
>  
> @@ -444,14 +446,14 @@ static ret_code cmd_ccls_get_partition_info(struct 
> cxl_cmd *cmd,
>  QEMU_BUILD_BUG_ON(sizeof(*part_info) != 0x20);
>  uint64_t size = cxl_dstate->pmem_size;
>  
> -if (!QEMU_IS_ALIGNED(size, 256 << 20)) {
> +if (!QEMU_IS_ALIGNED(size, CXL_CAPACITY_MULTIPLIER)) {
>  return CXL_MBOX_INTERNAL_ERROR;
>  }
>  
>  /* PMEM only */
>  part_info->active_vmem = 0;
>  part_info->next_vmem = 0;
> -part_info->active_pmem = size / (256 << 20);
> +part_info->active_pmem = size / CXL_CAPACITY_MULTIPLIER;
>  part_info->next_pmem = 0;
>  
>  *len = sizeof(*part_info);




Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-14 Thread Michael Roth
On Mon, Nov 14, 2022 at 06:28:43PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> > On 11/1/22 16:19, Michael Roth wrote:
> > > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > >> > 
> > >> >   1) restoring kernel directmap:
> > >> > 
> > >> >  Currently SNP (and I believe TDX) need to either split or remove 
> > >> > kernel
> > >> >  direct mappings for restricted PFNs, since there is no guarantee 
> > >> > that
> > >> >  other PFNs within a 2MB range won't be used for non-restricted
> > >> >  (which will cause an RMP #PF in the case of SNP since the 2MB
> > >> >  mapping overlaps with guest-owned pages)
> > >> 
> > >> Has the splitting and restoring been a well-discussed direction? I'm
> > >> just curious whether there is other options to solve this issue.
> > > 
> > > For SNP it's been discussed for quite some time, and either splitting or
> > > removing private entries from directmap are the well-discussed way I'm
> > > aware of to avoid RMP violations due to some other kernel process using
> > > a 2MB mapping to access shared memory if there are private pages that
> > > happen to be within that range.
> > > 
> > > In both cases the issue of how to restore directmap as 2M becomes a
> > > problem.
> > > 
> > > I was also under the impression TDX had similar requirements. If so,
> > > do you know what the plan is for handling this for TDX?
> > > 
> > > There are also 2 potential alternatives I'm aware of, but these haven't
> > > been discussed in much detail AFAIK:
> > > 
> > > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> > >request 2MB THP pages, but I'm not sure how reliably we can guarantee
> > >that enough THPs are available, so if we went that route we'd probably
> > >be better off requiring the use of hugetlbfs as the backing store. But
> > >obviously that's a bit limiting and it would be nice to have the option
> > >of using normal pages as well. One nice thing with invalidation
> > >scheme proposed here is that this would "Just Work" if implement
> > >hugetlbfs support, so an admin that doesn't want any directmap
> > >splitting has this option available, otherwise it's done as a
> > >best-effort.
> > > 
> > > b) Implement general support for restoring directmap as 2M even when
> > >subpages might be in use by other kernel threads. This would be the
> > >most flexible approach since it requires no special handling during
> > >invalidations, but I think it's only possible if all the CPA
> > >attributes for the 2M range are the same at the time the mapping is
> > >restored/unsplit, so some potential locking issues there and still
> > >chance for splitting directmap over time.
> > 
> > I've been hoping that
> > 
> > c) using a mechanism such as [1] [2] where the goal is to group together
> > these small allocations that need to increase directmap granularity so
> > maximum number of large mappings are preserved.
> 
> As I mentioned in the other thread the restricted memfd can be backed by
> secretmem instead of plain memfd. It already handles directmap with care.

It looks like it would handle direct unmapping/cleanup nicely, but it
seems to lack fallocate(PUNCH_HOLE) support which we'd probably want to
avoid additional memory requirements. I think once we added that we'd
still end up needing some sort of handling for the invalidations.

Also, I know Chao has been considering hugetlbfs support, I assume by
leveraging the support that already exists in shmem. Ideally SNP would
be able to make use of that support as well, but relying on a separate
backend seems likely to result in more complications getting there
later.

> 
> But I don't think it has to be part of initial restricted memfd
> implementation. It is SEV-specific requirement and AMD folks can extend
> implementation as needed later.

Admittedly the suggested changes to the invalidation mechanism made a
lot more sense to me when I was under the impression that TDX would have
similar requirements and we might end up with a common hook. Since that
doesn't actually seem to be the case, it makes sense to try to do it as
a platform-specific hook for SNP.

I think, given a memslot, a GFN range, and kvm_restricted_mem_get_pfn(),
we should be able to get the same information needed to figure out whether
the range is backed by huge pages or not. I'll see how that works out
instead.

Thanks,

Mike

> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v4 2/7] target/arm: Make RVBAR available for all ARMv8 CPUs

2022-11-14 Thread Peter Maydell
On Sun, 23 Oct 2022 at 16:37,  wrote:
>
> From: Tobias Röhmel 
>
> RVBAR shadows RVBAR_ELx where x is the highest exception
> level if the highest EL is not EL3. This patch also allows
> ARMv8 CPUs to change the reset address with
> the rvbar property.
>
> Signed-off-by: Tobias Röhmel 

> diff --git a/target/arm/helper.c b/target/arm/helper.c
> index 3c517356e1..2e9e420d4e 100644
> --- a/target/arm/helper.c
> +++ b/target/arm/helper.c
> @@ -7768,8 +7768,8 @@ void register_cp_regs_for_features(ARMCPU *cpu)
>  if (!arm_feature(env, ARM_FEATURE_EL3) &&
>  !arm_feature(env, ARM_FEATURE_EL2)) {
>  ARMCPRegInfo rvbar = {
> -.name = "RVBAR_EL1", .state = ARM_CP_STATE_AA64,
> -.opc0 = 3, .opc1 = 0, .crn = 12, .crm = 0, .opc2 = 1,
> +.name = "RVBAR_EL1", .state = ARM_CP_STATE_BOTH,
> +.opc0 = 3, .cp = 15, .opc1 = 0, .crn = 12, .crm = 0, .opc2 = 
> 1,

You don't need to specify .cp for a STATE_BOTH register: 15
is the default.

Otherwise
Reviewed-by: Peter Maydell 

thanks
-- PMM



Re: [PATCH 2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen

2022-11-14 Thread Marek Marczykowski-Górecki
On Mon, Nov 14, 2022 at 07:39:48PM +, Andrew Cooper wrote:
> On 14/11/2022 19:20, Marek Marczykowski-Górecki wrote:
> > The /dev/mem is used for two purposes:
> >  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
> >  - reading Pending Bit Array (PBA)
> >
> > The first one was originally done because when Xen did not send all
> > vector ctrl writes to the device model, so QEMU might have outdated old
> > register value. This has been changed in Xen, so QEMU can now use its
> > cached value of the register instead.

I should have noted that "has been changed in Xen" isn't fully accurate
(yet). It refers to this patch:
https://lore.kernel.org/xen-devel/20221114192100.1539267-2-marma...@invisiblethingslab.com/T/#u

> > The Pending Bit Array (PBA) handling is for the case where it lives on
> > the same page as the MSI-X table itself. Xen has been extended to handle
> > this case too (as well as other registers that may live on those pages),
> > so QEMU handling is not necessary anymore.
> >
> > Removing /dev/mem access is useful to work within stubdomain, and
> > necessary when dom0 kernel runs in lockdown mode.
> >
> > Signed-off-by: Marek Marczykowski-Górecki 
> 
> The commit message ought to go further.  Using /dev/mem like this is
> buggy anyway, because it is trapped and emulated by Xen in whatever
> context Qemu is running.  Qemu cannot get the actual hardware value, and
> even if it could, it would be racy with transient operations needing to
> mask the vector.
> 
> i.e. it's not just nice-to-remote - it's fixing real corner cases.

Good point, I'll extend it.
But for this to work, the Xen patch needs to go in first (which won't
happen for 4.17). And also, upstream QEMU probably needs some way to
detect whether Xen has the change or not - to work with older versions
too.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v7 11/12] multifd: Zero pages transmission

2022-11-14 Thread Juan Quintela
Leonardo Brás  wrote:
> On Tue, 2022-08-02 at 08:39 +0200, Juan Quintela wrote:
>> This implements the zero page dection and handling.
>> 
>> Signed-off-by: Juan Quintela 

Hi

on further investigation, I see why it can't be a problem.


>>  static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
>> @@ -358,6 +365,18 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
>> *p, Error **errp)
>>  p->normal[i] = offset;
>>  }
>>  
>> +for (i = 0; i < p->zero_num; i++) {
>> +uint64_t offset = be64_to_cpu(packet->offset[p->normal_num + i]);
>> +
>> +if (offset > (block->used_length - p->page_size)) {

We are checked that we are inside the RAM block.  You can't have a
bigger that 32bit offset when you have 32bits of RAM.


>> @@ -688,8 +710,16 @@ static void *multifd_send_thread(void *opaque)
>>  }
>>  
>>  for (int i = 0; i < p->pages->num; i++) {
>> -p->normal[p->normal_num] = p->pages->offset[i];
>> -p->normal_num++;
>> +uint64_t offset = p->pages->offset[i];

We are reading the offset here.
p->pages->offset is ram_addr_t, so no prolbem here.

>> +if (use_zero_page &&
>> +buffer_is_zero(rb->host + offset, p->page_size)) {
>> +p->zero[p->zero_num] = offset;
>
> Same here.

This and next case are exactly the same, we are doing:

ram_addr_t offset1;
u64 foo = offset1;
ram_addr_t offest2 = foo;

So, it should be right.  Everything is unsigned here.

>> +p->zero_num++;
>> +ram_release_page(rb->idstr, offset);
>> +} else {
>> +p->normal[p->normal_num] = offset;
>
> Same here? (p->normal[i] can also be u32)

Thanks, Juan.




[PATCH v2 1/1] tcg: add perfmap and jitdump

2022-11-14 Thread Ilya Leoshkevich
Add ability to dump /tmp/perf-.map and jit-.dump.
The first one allows the perf tool to map samples to each individual
translation block. The second one adds the ability to resolve symbol
names, line numbers and inspect JITed code.

Example of use:

perf record qemu-x86_64 -perfmap ./a.out
perf report

or

perf record -k 1 qemu-x86_64 -jitdump ./a.out
perf inject -j -i perf.data -o perf.data.jitted
perf report -i perf.data.jitted

Co-developed-by: Vanderson M. do Rosario 
Co-developed-by: Alex Bennée 
Signed-off-by: Ilya Leoshkevich 
Message-Id: <20221012051846.1432050-2-...@linux.ibm.com>
---
 accel/tcg/debuginfo.c |  99 +++
 accel/tcg/debuginfo.h |  52 ++
 accel/tcg/meson.build |   2 +
 accel/tcg/perf.c  | 334 ++
 accel/tcg/perf.h  |  28 
 accel/tcg/translate-all.c |   8 +
 docs/devel/tcg.rst|  23 +++
 hw/core/loader.c  |   5 +
 include/tcg/tcg.h |   4 +-
 linux-user/elfload.c  |   3 +
 linux-user/exit.c |   2 +
 linux-user/main.c |  15 ++
 linux-user/meson.build|   1 +
 meson.build   |   8 +
 qemu-options.hx   |  20 +++
 softmmu/vl.c  |  11 ++
 tcg/region.c  |   2 +-
 tcg/tcg.c |   2 +
 18 files changed, 616 insertions(+), 3 deletions(-)
 create mode 100644 accel/tcg/debuginfo.c
 create mode 100644 accel/tcg/debuginfo.h
 create mode 100644 accel/tcg/perf.c
 create mode 100644 accel/tcg/perf.h

diff --git a/accel/tcg/debuginfo.c b/accel/tcg/debuginfo.c
new file mode 100644
index 000..c312db77146
--- /dev/null
+++ b/accel/tcg/debuginfo.c
@@ -0,0 +1,99 @@
+/*
+ * Debug information support.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/lockable.h"
+
+#include 
+
+#include "debuginfo.h"
+
+static QemuMutex lock;
+static Dwfl *dwfl;
+static const Dwfl_Callbacks dwfl_callbacks = {
+.find_elf = NULL,
+.find_debuginfo = dwfl_standard_find_debuginfo,
+.section_address = NULL,
+.debuginfo_path = NULL,
+};
+
+__attribute__((constructor))
+static void debuginfo_init(void)
+{
+qemu_mutex_init();
+}
+
+bool debuginfo_report_elf(const char *image_name, int image_fd,
+  unsigned long long load_bias)
+{
+QEMU_LOCK_GUARD();
+
+if (dwfl == NULL) {
+dwfl = dwfl_begin(_callbacks);
+} else {
+dwfl_report_begin_add(dwfl);
+}
+
+if (dwfl == NULL) {
+return false;
+}
+
+dwfl_report_elf(dwfl, image_name, image_name, image_fd, load_bias, true);
+dwfl_report_end(dwfl, NULL, NULL);
+return true;
+}
+
+bool debuginfo_get_symbol(unsigned long long address,
+  const char **symbol, unsigned long long *offset)
+{
+Dwfl_Module *dwfl_module;
+GElf_Off dwfl_offset;
+GElf_Sym dwfl_sym;
+
+QEMU_LOCK_GUARD();
+
+if (dwfl == NULL) {
+return false;
+}
+
+dwfl_module = dwfl_addrmodule(dwfl, address);
+if (dwfl_module == NULL) {
+return false;
+}
+
+*symbol = dwfl_module_addrinfo(dwfl_module, address, _offset,
+   _sym, NULL, NULL, NULL);
+if (*symbol == NULL) {
+return false;
+}
+*offset = dwfl_offset;
+return true;
+}
+
+bool debuginfo_get_line(unsigned long long address,
+const char **file, int *line)
+{
+Dwfl_Module *dwfl_module;
+Dwfl_Line *dwfl_line;
+
+QEMU_LOCK_GUARD();
+
+if (dwfl == NULL) {
+return false;
+}
+
+dwfl_module = dwfl_addrmodule(dwfl, address);
+if (dwfl_module == NULL) {
+return false;
+}
+
+dwfl_line = dwfl_module_getsrc(dwfl_module, address);
+if (dwfl_line == NULL) {
+return false;
+}
+*file = dwfl_lineinfo(dwfl_line, NULL, line, 0, NULL, NULL);
+return true;
+}
diff --git a/accel/tcg/debuginfo.h b/accel/tcg/debuginfo.h
new file mode 100644
index 000..d41d9d8d9b4
--- /dev/null
+++ b/accel/tcg/debuginfo.h
@@ -0,0 +1,52 @@
+/*
+ * Debug information support.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef ACCEL_TCG_DEBUGINFO_H
+#define ACCEL_TCG_DEBUGINFO_H
+
+#if defined(CONFIG_TCG) && defined(CONFIG_LIBDW)
+/*
+ * Load debuginfo for the specified guest ELF image.
+ * Return true on success, false on failure.
+ */
+bool debuginfo_report_elf(const char *image_name, int image_fd,
+  unsigned long long load_bias);
+
+/*
+ * Find a symbol name associated with the specified guest PC.
+ * Return true on success, false if there is no associated symbol.
+ */
+bool debuginfo_get_symbol(unsigned long long address,
+  const char **symbol, unsigned long long *offset);
+
+/*
+ * Find a line number associated with the specified guest PC.
+ * Return true on success, false if there is no associated line number.
+ */
+bool debuginfo_get_line(unsigned long long address,
+ 

Re: [PULL 00/11] Block layer patches

2022-11-14 Thread Kevin Wolf
Am 11.11.2022 um 20:20 hat Stefan Hajnoczi geschrieben:
> > Hanna Reitz (9):
> >   block/mirror: Do not wait for active writes
> >   block/mirror: Drop mirror_wait_for_any_operation()
> >   block/mirror: Fix NULL s->job in active writes
> >   iotests/151: Test that active mirror progresses
> >   iotests/151: Test active requests on mirror start
> >   block: Make bdrv_child_get_parent_aio_context I/O
> >   block-backend: Update ctx immediately after root
> >   block: Start/end drain on correct AioContext
> >   tests/stream-under-throttle: New test
> 
> Hi Hanna,
> This test is broken, probably due to the minimum Python version:
> https://gitlab.com/qemu-project/qemu/-/jobs/3311521303

This is exactly the problem I saw with running linters in a gating CI,
but not during 'make check'. And of course, we're hitting it during the
-rc phase now. :-(

But yes, it seems that asyncio.TimeoutError should be used instead of
asyncio.exceptions.TimeoutError, and Python 3.6 has only the former.
I'll fix this up and send a v2 if it fixes check-python-pipenv.

Kevin




Re: aarch64 GitLab CI runner down

2022-11-14 Thread Philippe Mathieu-Daudé

Hi Stefan,

On 14/11/22 22:56, Stefan Hajnoczi wrote:

Hi Alex and Richard,
The aarch64 GitLab CI runner is down again. Are you able to restart it?


Alex wrote to your RH account :/ This is a scheduled shutdown.
Equinix is relocating the hardware and this runner will be down
for (at least) 4 days.


Any idea why it disconnects sometimes?


Sometimes? Odd. Can we check on the GitLab concentrator logs when
the runner is unreachable?

Regards,

Phil.



[PULL 0/2] target/i386: fix cmpxchgl, lahf, sahf

2022-11-14 Thread Richard Henderson
The following changes since commit 98f10f0e2613ba1ac2ad3f57a5174014f6dcb03d:

  Merge tag 'pull-target-arm-20221114' of 
https://git.linaro.org/people/pmaydell/qemu-arm into staging (2022-11-14 
13:31:17 -0500)

are available in the Git repository at:

  https://gitlab.com/rth7680/qemu.git tags/pull-x86-20221115

for you to fetch changes up to 35d95e4126d83c0bb0de83007494d184f6111b3d:

  target/i386: hardcode R_EAX as destination register for LAHF/SAHF (2022-11-15 
09:34:42 +1000)


Fix cmpxchgl writeback to rax.
Fix lahf/sahf for 64-bit


Paolo Bonzini (2):
  target/i386: fix cmpxchg with 32-bit register destination
  target/i386: hardcode R_EAX as destination register for LAHF/SAHF

 target/i386/tcg/translate.c  | 86 +++-
 tests/tcg/x86_64/cmpxchg.c   | 42 
 tests/tcg/x86_64/Makefile.target |  1 +
 3 files changed, 101 insertions(+), 28 deletions(-)
 create mode 100644 tests/tcg/x86_64/cmpxchg.c



Re: [PATCH v7 10/12] multifd: Support for zero pages transmission

2022-11-14 Thread Juan Quintela
chuang xu  wrote:
> On 2022/8/2 下午2:39, Juan Quintela wrote:
>> This patch adds counters and similar.  Logic will be added on the
>> following patch.
>>
>> Signed-off-by: Juan Quintela 
>>

>>   sync_needed = p->flags & MULTIFD_FLAG_SYNC;
>>   /* recv methods don't know how to handle the SYNC flag */
>>   p->flags &= ~MULTIFD_FLAG_SYNC;
>>   p->num_packets++;
>>   p->total_normal_pages += p->normal_num;
>> +p->total_normal_pages += p->zero_num;
>
> Hi, Juan:
>
> If I understand correctly, it should be "p->total_zero_pages +=
> p->zero_num; ".

Very good catch. Thanks.

That is what rebases make to you.

> By the way, This patch seems to greatly improve the performance of
> zero page checking,  but it seems that there has been no new update in
> the past two months. I want to know when it will be merged into
> master?

I am resending right now.

Later, Juan.




Re: [PATCH] s390x: Fix spelling errors

2022-11-14 Thread Ilya Leoshkevich
On Fri, Nov 11, 2022 at 07:28:28PM +0100, Thomas Huth wrote:
> Fix typos (discovered with the 'codespell' utility).
> 
> Signed-off-by: Thomas Huth 
> ---
>  hw/s390x/ipl.h  | 2 +-
>  pc-bios/s390-ccw/cio.h  | 2 +-
>  pc-bios/s390-ccw/iplb.h | 2 +-
>  target/s390x/cpu_models.h   | 4 ++--
>  hw/s390x/s390-pci-vfio.c| 2 +-
>  hw/s390x/s390-virtio-ccw.c  | 6 +++---
>  target/s390x/ioinst.c   | 2 +-
>  target/s390x/tcg/excp_helper.c  | 2 +-
>  target/s390x/tcg/fpu_helper.c   | 2 +-
>  target/s390x/tcg/misc_helper.c  | 2 +-
>  target/s390x/tcg/translate.c| 4 ++--
>  target/s390x/tcg/translate_vx.c.inc | 6 +++---
>  pc-bios/s390-ccw/start.S| 2 +-
>  13 files changed, 19 insertions(+), 19 deletions(-)

Reviewed-by: Ilya Leoshkevich 



Re: [PATCH v2] capstone: use instead of

2022-11-14 Thread Daniel P . Berrangé
On Mon, Nov 14, 2022 at 12:13:48PM +0300, Michael Tokarev wrote:
> 14.11.2022 11:58, Daniel P. Berrangé wrote:
> ..
> > > On current systems, using  works
> > > now (despite the pkg-config-supplied -I/usr/include/capstone) -
> > > since on all systems capstone headers are put into capstone/
> > > subdirectory of a system include dir. So this change is
> > > compatible with both the obsolete way of including it
> > > and the only future way.
> > 
> > AFAIR, macOS HomeBrew does not put anything into the system
> > include dir, and always requires -I flags to be correct.
> 
> Does it work with the capstone-supplied --cflags and the proposed
> include path?  What does pkg-config --cflags capstone return there?

I see the QEMU build logs adding:

 -I/usr/local/Cellar/capstone/4.0.2/include/capstone

so  #includeseems unlikely to work

> > > -  if capstone.found() and not cc.compiles('#include ',
> > > +  if capstone.found() and not cc.compiles('#include 
> > > ',
> > > dependencies: [capstone])
> > 
> > To retain back compat this could probe for both ways
> > 
> >  if capstone.found()
> >  if cc.compiles('#include ',
> >dependencies: [capstone])
> > ...
> >  else if cc.compiles('#include ',
> > dependencies: [capstone])
> > ...
> > then, the source file can try the correct #include based on what
> > we detect works here.
> 
> I don't think this deserves the complexity really, unless there *is*
> a system out there which actually needs this.
> 
> I mean, these little compat tweaks, - it becomes twisty with time,
> and no one knows which code paths and config variables are needed
> for what, and whole thing slowly becomes unmanageable... If it's
> easy to make it unconditional, it should be done. IMHO anyway :)

Well you're proposing a change during RC time which is likely to
break builds if the assumption that its always in the system
include path is wrong. So I think the explicit compatibility is
required to reduce the risk of this creating a regression.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v2 06/19] hw/9pfs: Add missing definitions for Windows

2022-11-14 Thread Christian Schoenebeck
On Friday, November 11, 2022 5:22:12 AM CET Bin Meng wrote:
> From: Guohuai Shi 
> 
> Some definitions currently used by the 9pfs codes are only available
> on POSIX platforms. Let's add our own ones in preparation to adding
> 9pfs support for Windows.
> 
> Signed-off-by: Guohuai Shi 
> Signed-off-by: Bin Meng 
> 
> ---
> 
> Changes in v2:
> - Add S_IFLNK related macros to support symbolic link
> 
>  fsdev/file-op-9p.h | 33 +
>  hw/9pfs/9p.h   | 33 +
>  2 files changed, 66 insertions(+)
> 
> diff --git a/fsdev/file-op-9p.h b/fsdev/file-op-9p.h
> index 4997677460..7d9a736b66 100644
> --- a/fsdev/file-op-9p.h
> +++ b/fsdev/file-op-9p.h
> @@ -27,6 +27,39 @@
>  # include 
>  #endif
>  
> +#ifdef CONFIG_WIN32
> +
> +/* POSIX structure not defined in Windows */
> +
> +typedef uint32_t uid_t;
> +typedef uint32_t gid_t;
> +
> +/* from http://man7.org/linux/man-pages/man2/statfs.2.html */
> +typedef uint32_t __fsword_t;
> +typedef uint32_t fsblkcnt_t;
> +typedef uint32_t fsfilcnt_t;
> +
> +/* from linux/include/uapi/asm-generic/posix_types.h */
> +typedef struct {
> +long __val[2];
> +} fsid_t;
> +
> +struct statfs {
> +__fsword_t f_type;
> +__fsword_t f_bsize;
> +fsblkcnt_t f_blocks;
> +fsblkcnt_t f_bfree;
> +fsblkcnt_t f_bavail;
> +fsfilcnt_t f_files;
> +fsfilcnt_t f_ffree;
> +fsid_t f_fsid;
> +__fsword_t f_namelen;
> +__fsword_t f_frsize;
> +__fsword_t f_flags;
> +};
> +

Does it make sense to define all of these, even though not being used?

> +#endif /* CONFIG_WIN32 */
> +
>  #define SM_LOCAL_MODE_BITS0600
>  #define SM_LOCAL_DIR_MODE_BITS0700
>  
> diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
> index 2fce4140d1..957a7e4ccc 100644
> --- a/hw/9pfs/9p.h
> +++ b/hw/9pfs/9p.h
> @@ -3,13 +3,46 @@
>  
>  #include 
>  #include 
> +#ifndef CONFIG_WIN32
>  #include 
> +#endif
>  #include "fsdev/file-op-9p.h"
>  #include "fsdev/9p-iov-marshal.h"
>  #include "qemu/thread.h"
>  #include "qemu/coroutine.h"
>  #include "qemu/qht.h"
>  
> +#ifdef CONFIG_WIN32
> +
> +#define NAME_MAXMAX_PATH

That's not quite the same. MAX_PATH on Windows corresponds to PATH_MAX on
POSIX, which is the max. length of an entire path (i.e. drive, multiple
directory names, filename, backslashes). AFAICS MAX_PATH is 260 on Windows.

The max. length of a single filename component OTOH is 255 on Windows by
default. I don't know if there is a macro for the latter, if not, maybe
just hard coding it here for now?

> +
> +/* macros required for build, values do not matter */
> +#define AT_SYMLINK_NOFOLLOW 0x100   /* Do not follow symbolic links */
> +#define AT_REMOVEDIR0x200   /* Remove directory instead of file */
> +#define O_DIRECTORY 0200
> +
> +#define makedev(major, minor)   \
> +((dev_t)major) & 0xfff) << 8) | ((minor) & 0xff)))
> +#define major(dev)  ((unsigned int)(((dev) >> 8) & 0xfff))
> +#define minor(dev)  ((unsigned int)(((dev) & 0xff)))
> +
> +#ifndef S_IFLNK
> +/*
> + * Currenlty Windows/MinGW does not provide the following flag macros,
> + * so define them here for 9p codes.
> + *
> + * Once Windows/MinGW provides them, remove the defines to prevent conflicts.
> + */
> +#define S_IFLNK 0xA000
> +#define S_ISUID 0x0800
> +#define S_ISGID 0x0400
> +#define S_ISVTX 0x0200
> +
> +#define S_ISLNK(mode)   ((mode & S_IFMT) == S_IFLNK)
> +#endif /* S_IFLNK */
> +
> +#endif /* CONFIG_WIN32 */
> +
>  enum {
>  P9_TLERROR = 6,
>  P9_RLERROR,
> 





Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:
> 
> 
> Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:
> > On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:
> > > Am 08.11.22 um 10:23 schrieb Alex Bennée:
> > > > The previous fix to virtio_device_started revealed a problem in its
> > > > use by both the core and the device code. The core code should be able
> > > > to handle the device "starting" while the VM isn't running to handle
> > > > the restoration of migration state. To solve this dual use introduce a
> > > > new helper for use by the vhost-user backends who all use it to feed a
> > > > should_start variable.
> > > > 
> > > > We can also pick up a change vhost_user_blk_set_status while we are at
> > > > it which follows the same pattern.
> > > > 
> > > > Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
> > > > virtio_device_started)
> > > > Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio 
> > > > device)
> > > > Signed-off-by: Alex Bennée 
> > > > Cc: "Michael S. Tsirkin" 
> > > 
> > > Hmmm, is this
> > > commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
> > > Author: Alex Bennée 
> > > AuthorDate: Mon Nov 7 12:14:07 2022 +
> > > Commit: Michael S. Tsirkin 
> > > CommitDate: Mon Nov 7 14:08:18 2022 -0500
> > > 
> > >  hw/virtio: introduce virtio_device_should_start
> > > 
> > > and older version?
> > 
> > This is what got merged:
> > https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
> > This patch was sent after I merged the RFC.
> > I think the only difference is the commit log but I might be missing
> > something.
> > 
> > > This does not seem to fix the regression that I have reported.
> > 
> > This was applied on top of 9f6bcfd99f which IIUC does, right?
> > 
> > 
> 
> QEMU master still fails for me for suspend/resume to disk:
> 
> #0  0x03ff8e3980a6 in __pthread_kill_implementation () at /lib64/libc.so.6
> #1  0x03ff8e348580 in raise () at /lib64/libc.so.6
> #2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
> #3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
> #4  0x03ff8e340a4e in  () at /lib64/libc.so.6
> #5  0x02aa1ffa8966 in vhost_vsock_common_pre_save (opaque= out>) at ../hw/virtio/vhost-vsock-common.c:203
> #6  0x02aa1fe5e0ee in vmstate_save_state_v
> (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0 
> , opaque=0x2aa21bac9f8, 
> vmdesc=vmdesc@entry=0x3fddc08eb30, version_id=version_id@entry=0) at 
> ../migration/vmstate.c:329
> #7  0x02aa1fe5ebf8 in vmstate_save_state (f=f@entry=0x2aa21bdc170, 
> vmsd=, opaque=, 
> vmdesc_id=vmdesc_id@entry=0x3fddc08eb30) at ../migration/vmstate.c:317
> #8  0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170, 
> se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at 
> ../migration/savevm.c:908
> #9  0x02aa1fe79584 in qemu_savevm_state_complete_precopy_non_iterable 
> (f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false, 
> inactivate_disks=inactivate_disks@entry=true)
> at ../migration/savevm.c:1393
> #10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy 
> (f=0x2aa21bdc170, iterable_only=iterable_only@entry=false, 
> inactivate_disks=inactivate_disks@entry=true) at ../migration/savevm.c:1459
> #11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
> ../migration/migration.c:3314
> #12 migration_iteration_run (s=0x2aa218ef600) at ../migration/migration.c:3761
> #13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
> ../migration/migration.c:3989
> #14 0x02aa201f0b8c in qemu_thread_start (args=) at 
> ../util/qemu-thread-posix.c:505
> #15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
> #16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6
> 
> Michael, your previous branch did work if I recall correctly.

That one was failing under github CI though (for reasons we didn't
really address, such as disconnect during stop causing a recursive
call to stop, but there you are).

-- 
MST




[PATCH v4] qapi/qom: Memory backend property prealloc-threads doc fix

2022-11-14 Thread Zhenyu Zhang
Commit ffac16fab3 "hostmem: introduce "prealloc-threads" property"
(v5.0.0) changed the default number of threads from number of CPUs
to 1.  This was deemed a regression, and fixed in commit f8d426a685
"hostmem: default the amount of prealloc-threads to smp-cpus".
Except the documentation remained unchanged.

Update 'qapi/qom.json' to reflect the change in commit f8d426a6852c.

Signed-off-by: Zhenyu Zhang 
---

v4: Fix the line exceeding 80 characters limitation issue  (Gavin)
v3: Covers historical descriptions (Markus)
v2: The property is changed to smp-cpus since 5.0  (Phild)

---
 qapi/qom.json | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 30e76653ad..f4a7917f3d 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -576,7 +576,8 @@
 #
 # @prealloc: if true, preallocate memory (default: false)
 #
-# @prealloc-threads: number of CPU threads to use for prealloc (default: 1)
+# @prealloc-threads: number of CPU threads to use for prealloc 
+#(default: number of CPUs) (since 5.0)
 #
 # @prealloc-context: thread context to use for creation of preallocation 
threads
 #(default: none) (since 7.2)
-- 
2.31.1




[PATCH 0/1] hw/arm/aspeed: Automatically zero-extend flash images

2022-11-14 Thread Peter Delevoryas
I've been using this patch for a long time so that I don't have to use
dd to zero-extend stuff all the time. It's just doing what people are
doing already, right? I hope it's not controversial.

One note: I couldn't figure out how to make it work without changing the
permissions on the block device to allow truncation. If somebody knows
how to avoid the `blk_get_perm`, `blk_set_perm` calls here, let me know!

Thanks,
Peter

Peter Delevoryas (1):
  hw/arm/aspeed: Automatically zero-extend flash images

 hw/arm/aspeed.c | 40 +++-
 1 file changed, 39 insertions(+), 1 deletion(-)

-- 
2.38.1




Re: [PATCH] hw/intc: sifive_plic: Renumber the S irqs for numa support

2022-11-14 Thread Alistair Francis
On Fri, Nov 11, 2022 at 10:20 PM Frédéric Pétrot
 wrote:
>
> Commit 40244040 changed the way the S irqs are numbered. This breaks when
> using numa configuration, e.g.:
> ./qemu-system-riscv64 -nographic -machine virt,dumpdtb=numa-tree.dtb \
>   -m 2G -smp cpus=16 \
>   -object memory-backend-ram,id=mem0,size=512M \
>   -object memory-backend-ram,id=mem1,size=512M \
>   -object memory-backend-ram,id=mem2,size=512M \
>   -object memory-backend-ram,id=mem3,size=512M \
>   -numa node,cpus=0-3,memdev=mem0,nodeid=0 \
>   -numa node,cpus=4-7,memdev=mem1,nodeid=1 \
>   -numa node,cpus=8-11,memdev=mem2,nodeid=2 \
>   -numa node,cpus=12-15,memdev=mem3,nodeid=3
> leads to:
> Unexpected error in object_property_find_err() at ../qom/object.c:1304:
> qemu-system-riscv64: Property 'riscv.sifive.plic.unnamed-gpio-out[8]' not
> found
>
> This patch makes the nubering of the S irqs identical to what it was before.
>
> Signed-off-by: Frédéric Pétrot 

Thanks!

Applied to riscv-to-apply.next

Alistair

> ---
>  hw/intc/sifive_plic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
> index c2dfacf028..89d2122742 100644
> --- a/hw/intc/sifive_plic.c
> +++ b/hw/intc/sifive_plic.c
> @@ -480,7 +480,7 @@ DeviceState *sifive_plic_create(hwaddr addr, char 
> *hart_config,
>qdev_get_gpio_in(DEVICE(cpu), IRQ_M_EXT));
>  }
>  if (plic->addr_config[i].mode == PLICMode_S) {
> -qdev_connect_gpio_out(dev, cpu_num,
> +qdev_connect_gpio_out(dev, cpu_num - plic->hartid_base,
>qdev_get_gpio_in(DEVICE(cpu), IRQ_S_EXT));
>  }
>  }
> --
> 2.37.2
>
>



[PATCH] docs/system/s390x: Document the "loadparm" machine property

2022-11-14 Thread Thomas Huth
The "loadparm" machine property is useful for selecting alternative
kernels on the disk of the guest, but so far we do not tell the users
yet how to use it. Add some documentation to fill this gap.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2128235
Signed-off-by: Thomas Huth 
---
 docs/system/s390x/bootdevices.rst | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/docs/system/s390x/bootdevices.rst 
b/docs/system/s390x/bootdevices.rst
index b5950133e8..40089c35a9 100644
--- a/docs/system/s390x/bootdevices.rst
+++ b/docs/system/s390x/bootdevices.rst
@@ -53,6 +53,32 @@ recommended to specify a CD-ROM device via ``-device 
scsi-cd`` (as mentioned
 above) instead.
 
 
+Selecting kernels with the ``loadparm`` property
+
+
+The ``s390-ccw-virtio`` machine supports the so-called ``loadparm`` parameter
+which can be used to select the kernel on the disk of the guest that the
+s390-ccw bios should boot. When starting QEMU, it can be specified like this::
+
+ qemu-system-s390x -machine s390-ccw-virtio,loadparm=
+
+The first way to use this parameter is to use the word ``PROMPT`` as the
+ here. In that case the s390-ccw bios will show a list of
+installed kernels on the disk of the guest and ask the user to enter a number
+to chose which kernel should be booted -- similar to what can be achieved by
+specifying the ``-boot menu=on`` option when starting QEMU. Note that the menu
+list will only show the names of the installed kernels when using a DASD-like
+disk image with 4k byte sectors, on normal SCSI-style disks with 512-byte
+sectors, there is not enough space for the zipl loader on the disk to store
+the kernel names, so you only get a list without names here.
+
+The second way to use this parameter is to use a number in the range from 0
+to 31. The numbers that can be used here correspond to the numbers that are
+shown when using the ``PROMPT`` option, and the s390-ccw bios will then try
+to automatically boot the kernel that is associated with the given number.
+Note that ``0`` can be used to boot the default entry.
+
+
 Booting from a network device
 -
 
-- 
2.31.1




Re: [PATCH v2 02/12] tests/avocado: improve behaviour waiting for login prompts

2022-11-14 Thread Peter Maydell
On Fri, 11 Nov 2022 at 14:58, Alex Bennée  wrote:
>
> This attempts to deal with the problem of login prompts not being
> guaranteed to be terminated with a newline. The solution to this is to
> peek at the incoming data looking to see if we see an up-coming match
> before we fall back to the old readline() logic. The reason to mostly
> rely on readline is because I am occasionally seeing the peek stalling
> despite data being there.
>
> This seems kinda hacky and gross so I'm open to alternative approaches
> and cleaner python code.
>
> Signed-off-by: Alex Bennée 
> Reviewed-by: Philippe Mathieu-Daudé 

With this patch, the evb_sdk test fails:

Fetching asset from
./build/arm-clang/tests/avocado/machine_aspeed.py:AST2x00MachineSDK.test_arm_ast2500_evb_sdk
JOB ID : 542e050c4f7e1ddd6d5cdd350e4c26e1bdfcdee4
JOB LOG: 
/home/petmay01/avocado/job-results/job-2022-11-14T16.21-542e050/job.log
 (1/1) 
./build/arm-clang/tests/avocado/machine_aspeed.py:AST2x00MachineSDK.test_arm_ast2500_evb_sdk:
ERROR: log() missing 1 required positional argument: 'msg' (82.57 s)
RESULTS: PASS 0 | ERROR 1 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0
| CANCEL 0
JOB TIME   : 84.09 s

The avocado log reports a traceback where Python has thrown a
UnicodeDecodeError, and then subsequently an attempted debug
message in the error-handling path has a syntax error
("log() missing 1 required positional argument"):

2022-11-14 16:22:47,952 __init__ L0240 DEBUG| Stopping Host
logger for ttyVUART0...
2022-11-14 16:22:48,240 __init__ L0240 DEBUG| Starting Wait
for /xyz/ope…control/host0/boot/one_time...
2022-11-14 16:22:48,570 stacktrace   L0039 ERROR|
2022-11-14 16:22:48,570 stacktrace   L0041 ERROR| Reproduced
traceback from: /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm
-clang/tests/venv/lib/python3.8/site-packages/avocado/core/test.py:770
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR| Traceback (most
recent call last):
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/avocad
o/avocado_qemu/__init__.py", line 149, in _peek_ahead
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR| peek_ahead =
console.peek(min_match).decode()
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR|
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position
44-45: unexpec
ted end of data
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR|
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR| During handling
of the above exception, another exception occurred:
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR|
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR| Traceback (most
recent call last):
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/venv/l
ib/python3.8/site-packages/avocado/core/decorators.py", line 90, in wrapper
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR| return
function(obj, *args, **kwargs)
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/avocad
o/machine_aspeed.py", line 225, in test_arm_ast2500_evb_sdk
2022-11-14 16:22:48,572 stacktrace   L0045 ERROR|
self.wait_for_console_pattern('ast2500-default login:')
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/avocad
o/machine_aspeed.py", line 193, in wait_for_console_pattern
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|
wait_for_console_pattern(self, success_message,
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/avocad
o/avocado_qemu/__init__.py", line 286, in wait_for_console_pattern
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|
_console_interaction(test, success_message, failure_message, None,
vm=vm)
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/avocad
o/avocado_qemu/__init__.py", line 226, in _console_interaction
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR| msg =
_peek_ahead(console, min_match, success_message)
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/avocad
o/avocado_qemu/__init__.py", line 180, in _peek_ahead
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|
console_logger.log("error in decode of peek")
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR| TypeError: log()
missing 1 required positional argument: 'msg'
2022-11-14 16:22:48,573 stacktrace   L0046 ERROR|
2022-11-14 16:22:48,573 test L0775 DEBUG| Local variables:
2022-11-14 16:22:48,605 test L0778 DEBUG|  -> obj :
1-./build/arm-clang/tests/avocado/machine_aspeed.py:AST2x00MachineSDK.test_arm_ast2500_evb_sdk
2022-11-14 16:22:48,605 test 

Re: [PATCH v2 14/15] migration: Remove old preempt code around state maintainance

2022-11-14 Thread Juan Quintela
Peter Xu  wrote:
> With the new code to send pages in rp-return thread, there's little help to
> keep lots of the old code on maintaining the preempt state in migration
> thread, because the new way should always be faster..
>
> Then if we'll always send pages in the rp-return thread anyway, we don't
> need those logic to maintain preempt state anymore because now we serialize
> things using the mutex directly instead of using those fields.
>
> It's very unfortunate to have those code for a short period, but that's
> still one intermediate step that we noticed the next bottleneck on the
> migration thread.  Now what we can do best is to drop unnecessary code as
> long as the new code is stable to reduce the burden.  It's actually a good
> thing because the new "sending page in rp-return thread" model is (IMHO)
> even cleaner and with better performance.
>
> Remove the old code that was responsible for maintaining preempt states, at
> the meantime also remove x-postcopy-preempt-break-huge parameter because
> with concurrent sender threads we don't really need to break-huge anymore.
>
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 




[PULL 2/2] hw/intc/arm_gicv3: fix prio masking on pmr write

2022-11-14 Thread Peter Maydell
From: Jens Wiklander 

With commit 39f29e599355 ("hw/intc/arm_gicv3: Use correct number of
priority bits for the CPU") the number of priority bits was changed from
the maximum value 8 to typically 5. As a consequence a few of the lowest
bits in ICC_PMR_EL1 becomes RAZ/WI. However prior to this patch one of
these bits was still used since the supplied priority value is masked
before it's eventually right shifted with one bit. So the bit is not
lost as one might expect when the register is read again.

The Linux kernel depends on lowest valid bit to be reset to zero, see
commit 33625282adaa ("irqchip/gic-v3: Probe for SCR_EL3 being clear
before resetting AP0Rn") for details.

So fix this by masking the priority value after it may have been right
shifted by one bit.

Cc: qemu-sta...@nongnu.org
Fixes: 39f29e599355 ("hw/intc/arm_gicv3: Use correct number of priority bits 
for the CPU")
Signed-off-by: Jens Wiklander 
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_cpuif.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index 8ca630e5ad1..b17b29288c7 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -1016,8 +1016,6 @@ static void icc_pmr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 
 trace_gicv3_icc_pmr_write(gicv3_redist_affid(cs), value);
 
-value &= icc_fullprio_mask(cs);
-
 if (arm_feature(env, ARM_FEATURE_EL3) && !arm_is_secure(env) &&
 (env->cp15.scr_el3 & SCR_FIQ)) {
 /* NS access and Group 0 is inaccessible to NS: return the
@@ -1029,6 +1027,7 @@ static void icc_pmr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 }
 value = (value >> 1) | 0x80;
 }
+value &= icc_fullprio_mask(cs);
 cs->icc_pmr_el1 = value;
 gicv3_cpuif_update(cs);
 }
-- 
2.25.1




Qemu virtual CPU limitation to 1024

2022-11-14 Thread Pawel Polawski
Hi Everyone,

I am trying to check qemu virtual cpu boundaries when running a custom
edk2 based firmware build. For that purpose I want to run qemu with more
than 1024 vCPU:
$QEMU
-accel kvm
-m 4G
-M q35,kernel-irqchip=on,smm=on
-smp cpus=1025,maxcpus=1025 -global mch.extended-tseg-mbytes=128
-drive if=pflash,format=raw,file=${CODE},readonly=on
-drive if=pflash,format=raw,file=${VARS}
-chardev stdio,id=fwlog
-device isa-debugcon,iobase=0x402,chardev=fwlog "$@"

The result is as follows:
QEMU emulator version 7.0.50 (v7.0.0-1651-g9cc1bf1ebc-dirty)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested
(1025) exceeds the recommended cpus supported by KVM (8)
Number of SMP cpus requested (1025) exceeds the maximum cpus supported by
KVM (1024)

It is not clear to me if I am hitting qemu limitation or KVM limitation
here.
I have changed hardcoded 1024 limits in hw/i386/* files but the limitation
is still presented.

Can someone advise what I should debug next looking for those vCPU limits?

Best regards,
Pawel


-- 

Paweł Poławski

Red Hat  Virtualization

ppola...@redhat.com
@RedHat    Red Hat
  Red Hat




[PATCH 1/2] hw/xen/xen_pt: Call default handler only if no custom one is set

2022-11-14 Thread Marek Marczykowski-Górecki
Call pci_default_write_config() in xen_pt_pci_write_config() only for
registers that do not have custom handler, and do that only after
resolving them. This is important for two reasons:
1. XenPTRegInfo has ro_mask which needs to be enforced - Xen-specific
   hooks do that on their own (especially xen_pt_*_reg_write()).
2. Not setting value early allows the hooks to see the old value too.

If it would be only about the first point, setting PCIDevice.wmask would
probably be sufficient, but given the second point, restructure those
writes.

Signed-off-by: Marek Marczykowski-Górecki 
---
 hw/xen/xen_pt.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index 0ec7e52183..269bd26109 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -255,6 +255,7 @@ static void xen_pt_pci_write_config(PCIDevice *d, uint32_t 
addr,
 uint32_t find_addr = addr;
 XenPTRegInfo *reg = NULL;
 bool wp_flag = false;
+uint32_t emul_mask = 0, write_val;
 
 if (xen_pt_pci_config_access_check(d, addr, len)) {
 return;
@@ -310,7 +311,6 @@ static void xen_pt_pci_write_config(PCIDevice *d, uint32_t 
addr,
 }
 
 memory_region_transaction_begin();
-pci_default_write_config(d, addr, val, len);
 
 /* adjust the read and write value to appropriate CFC-CFF window */
 read_val <<= (addr & 3) << 3;
@@ -370,6 +370,8 @@ static void xen_pt_pci_write_config(PCIDevice *d, uint32_t 
addr,
 return;
 }
 
+emul_mask |= ( (1 << (reg->size * 8) ) - 1 ) << ((find_addr & 3) * 
8);
+
 /* calculate next address to find */
 emul_len -= reg->size;
 if (emul_len > 0) {
@@ -396,6 +398,24 @@ static void xen_pt_pci_write_config(PCIDevice *d, uint32_t 
addr,
 /* need to shift back before passing them to xen_host_pci_set_block. */
 val >>= (addr & 3) << 3;
 
+/* store emulated registers that didn't have specific hooks */
+write_val = val;
+for (index = 0; emul_mask; index += emul_len) {
+emul_len = 0;
+while (emul_mask & 0xff) {
+emul_len++;
+emul_mask >>= 8;
+}
+if (emul_len) {
+uint32_t mask = ((1 << (emul_len * 8)) - 1);
+pci_default_write_config(d, addr, write_val & mask, emul_len);
+write_val >>= emul_len * 8;
+} else {
+emul_mask >>= 8;
+write_val >>= 8;
+}
+}
+
 memory_region_transaction_commit();
 
 out:
-- 
2.37.3




Re: [PATCH 05/13] block: Inline bdrv_drain_invoke()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

bdrv_drain_invoke() has now two entirely separate cases that share no
code any more and are selected depending on a bool parameter. Each case
has only one caller. Just inline the function.

Signed-off-by: Kevin Wolf 
---
  block/io.c | 23 ++-
  1 file changed, 6 insertions(+), 17 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH v6 05/10] vdpa: move SVQ vring features check to net/

2022-11-14 Thread Jason Wang



在 2022/11/11 20:58, Eugenio Perez Martin 写道:

On Fri, Nov 11, 2022 at 9:07 AM Jason Wang  wrote:

On Fri, Nov 11, 2022 at 3:56 PM Eugenio Perez Martin
 wrote:

On Fri, Nov 11, 2022 at 8:34 AM Jason Wang  wrote:


在 2022/11/10 21:09, Eugenio Perez Martin 写道:

On Thu, Nov 10, 2022 at 6:40 AM Jason Wang  wrote:

在 2022/11/9 01:07, Eugenio Pérez 写道:

The next patches will start control SVQ if possible. However, we don't
know if that will be possible at qemu boot anymore.

If I was not wrong, there's no device specific feature that is checked
in the function. So it should be general enough to be used by devices
other than net. Then I don't see any advantage of doing this.


Because vhost_vdpa_init_svq is called at qemu boot, failing if it is
not possible to shadow the Virtqueue.

Now the CVQ will be shadowed if possible, so we need to check this at
device start, not at initialization.


Any reason we can't check this at device start? We don't need
driver_features and we can do any probing to make sure cvq has an unique
group during initialization time.


We need the CVQ index to check if it has an independent group. CVQ
index depends on the features the guest's ack:
* If it acks _F_MQ, it is the last one.
* If it doesn't, CVQ idx is 2.

We cannot have acked features at initialization, and they could
change: It is valid for a guest to ack _F_MQ, then reset the device,
then not ack it.

Can we do some probing by negotiating _F_MQ if the device offers it,
then we can know if cvq has a unique group?


What if the guest does not ack _F_MQ?

To be completed it would go like:

* Probe negotiate _F_MQ, check unique group,
* Probe negotiate !_F_MQ, check unique group,



I think it should be a bug if device present a unique virtqueue group 
that depends on a specific feature. That is to say, we can do a single 
round of probing instead of try it twice here.




* Actually negotiate with the guest's feature set.
* React to failures. Probably the same way as if the CVQ is not
isolated, disabling SVQ?

To me it seems simpler to specify somehow that the vq must be independent.



It's just a suggestion, if you think doing it at the start, I'm fine. 
But we need document the reason with a comment maybe.


Thanks




Thanks!


   To store this information at boot
time is not valid anymore, because v->shadow_vqs_enabled is not valid
at this time anymore.


Ok, but this doesn't explain why it is net specific but vhost-vdpa specific.


We can try to move it to a vhost op, but we have the same problem as
the svq array allocation: We don't have the right place in vhost ops
to check this. Maybe vhost_set_features is the right one here?

If we can do all the probing at the initialization phase, we can do
everything there.

Thanks


Thanks!


Thanks



Thanks!


Thanks



Since the moved checks will be already evaluated at net/ to know if it
is ok to shadow CVQ, move them.

Signed-off-by: Eugenio Pérez 
---
hw/virtio/vhost-vdpa.c | 33 ++---
net/vhost-vdpa.c   |  3 ++-
2 files changed, 4 insertions(+), 32 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3df2775760..146f0dcb40 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -402,29 +402,9 @@ static int vhost_vdpa_get_dev_features(struct vhost_dev 
*dev,
return ret;
}

-static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
-   Error **errp)
+static void vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v)
{
g_autoptr(GPtrArray) shadow_vqs = NULL;
-uint64_t dev_features, svq_features;
-int r;
-bool ok;
-
-if (!v->shadow_vqs_enabled) {
-return 0;
-}
-
-r = vhost_vdpa_get_dev_features(hdev, _features);
-if (r != 0) {
-error_setg_errno(errp, -r, "Can't get vdpa device features");
-return r;
-}
-
-svq_features = dev_features;
-ok = vhost_svq_valid_features(svq_features, errp);
-if (unlikely(!ok)) {
-return -1;
-}

shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
for (unsigned n = 0; n < hdev->nvqs; ++n) {
@@ -436,7 +416,6 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, 
struct vhost_vdpa *v,
}

v->shadow_vqs = g_steal_pointer(_vqs);
-return 0;
}

static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error 
**errp)
@@ -461,11 +440,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)
dev->opaque =  opaque ;
v->listener = vhost_vdpa_memory_listener;
v->msg_type = VHOST_IOTLB_MSG_V2;
-ret = vhost_vdpa_init_svq(dev, v, errp);
-if (ret) {
-goto err;
-}
-
+vhost_vdpa_init_svq(dev, v);
vhost_vdpa_get_iova_range(v);

if (!vhost_vdpa_first_dev(dev)) {
@@ -476,10 +451,6 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)

Re: [PATCH 10/13] block: Call drain callbacks only once

2022-11-14 Thread Kevin Wolf
Am 09.11.2022 um 19:05 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11/8/22 15:37, Kevin Wolf wrote:
> > We only need to call both the BlockDriver's callback and the parent
> > callbacks when going from undrained to drained or vice versa. A second
> > drain section doesn't make a difference for the driver or the parent,
> > they weren't supposed to send new requests before and after the second
> > drain.
> > 
> > One thing that gets in the way is the 'ignore_bds_parents' parameter in
> > bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true
> > for the first drain, bs->quiesce_counter will be non-zero, but the
> > parent callbacks still haven't been called, so a second drain where it
> > is false would still have to call them.
> > 
> > Instead of keeping track of this, let's just get rid of the parameter.
> > It was introduced in commit 6cd5c9d7b2d as an optimisation so that
> > during bdrv_drain_all(), we wouldn't recursively drain all parents up to
> > the root for each node, resulting in quadratic complexity. As it happens,
> > calling the callbacks only once solves the same problem, so as of this
> > patch, we'll still have O(n) complexity and ignore_bds_parents is not
> > needed any more.
> > 
> > This patch only ignores the 'ignore_bds_parents' parameter. It will be
> > removed in a separate patch.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >   block.c  | 13 ++---
> >   block/io.c   | 24 +---
> >   tests/unit/test-bdrv-drain.c | 16 ++--
> >   3 files changed, 29 insertions(+), 24 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 9d082631d9..8878586f6e 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2816,7 +2816,6 @@ static void bdrv_replace_child_noperm(BdrvChild 
> > *child,
> >   {
> >   BlockDriverState *old_bs = child->bs;
> >   int new_bs_quiesce_counter;
> > -int drain_saldo;
> >   assert(!child->frozen);
> >   assert(old_bs != new_bs);
> > @@ -2827,15 +2826,13 @@ static void bdrv_replace_child_noperm(BdrvChild 
> > *child,
> >   }
> >   new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
> > -drain_saldo = new_bs_quiesce_counter - child->parent_quiesce_counter;
> >   /*
> >* If the new child node is drained but the old one was not, flush
> >* all outstanding requests to the old child node.
> >*/
> > -while (drain_saldo > 0 && child->klass->drained_begin) {
> > +if (new_bs_quiesce_counter && !child->parent_quiesce_counter) {
> 
> Looks like checking for child->klass->drained_begin was a wrong thing
> even prepatch?

I'm not sure if it was strictly wrong in practice, but at least
unnecessary. It would have been wrong if a BdrvChildClass implemented
for example .drained_begin, but not .drain_end. But I think we always
implement all three of .drained_begin/poll/end or none of them.

> Also, parent_quiesce_counter actually becomes a boolean variable..
> Should we stress it by new type and name?

Ok, but I would do that in a separate patch. Maybe 'bool drains_parent'.

> >   bdrv_parent_drained_begin_single(child, true);
> > -drain_saldo--;
> >   }
> >   if (old_bs) {
> > @@ -2859,7 +2856,6 @@ static void bdrv_replace_child_noperm(BdrvChild 
> > *child,
> >* more often.
> >*/
> 
> the comment above ^^^ should be updated, we are not going to call
> drained_end more than once anyway
> 
> >   assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
> 
> do we still need this assertion and the comment at all?

Patch 12 removes both, but I can do it already here.

> > -drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter;
> >   if (child->klass->attach) {
> >   child->klass->attach(child);
> > @@ -2869,10 +2865,13 @@ static void bdrv_replace_child_noperm(BdrvChild 
> > *child,
> >   /*
> >* If the old child node was drained but the new one is not, allow
> >* requests to come in only after the new node has been attached.
> > + *
> > + * Update new_bs_quiesce_counter because 
> > bdrv_parent_drained_begin_single()
> > + * polls, which could have changed the value.
> >*/
> > -while (drain_saldo < 0 && child->klass->drained_end) {
> > +new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
> > +if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
> >   bdrv_parent_drained_end_single(child);
> > -drain_saldo++;
> >   }
> >   }
> > diff --git a/block/io.c b/block/io.c
> > index 870a25d7a5..87c7a92f15 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -62,7 +62,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c)
> >   {
> >   IO_OR_GS_CODE();
> > -assert(c->parent_quiesce_counter > 0);
> > +assert(c->parent_quiesce_counter == 1);
> >   c->parent_quiesce_counter--;
> >   if (c->klass->drained_end) {

Re: [PATCH] hw/sd: Fix sun4i allwinner-sdhost for U-Boot

2022-11-14 Thread Peter Maydell
On Mon, 14 Nov 2022 at 17:29, Strahinja Jankovic
 wrote:
>
> Hi,
>
> Thank you for your reply.
>
> On Mon, Nov 14, 2022 at 4:42 PM Peter Maydell  
> wrote:
> >
> > On Sat, 12 Nov 2022 at 21:49, Strahinja Jankovic
> >  wrote:
> > >
> > > Trying to run U-Boot for Cubieboard (Allwinner A10) fails because it 
> > > cannot
> > > access SD card. The problem is that FIFO register in current
> > > allwinner-sdhost implementation is at the address corresponding to
> > > Allwinner H3, but not A10.
> > > Linux kernel is not affected since Linux driver uses DMA access and does
> > > not use FIFO register for reading/writing.
> > >
> > > This patch adds new class parameter `is_sun4i` and based on that
> > > parameter uses register at offset 0x100 either as FIFO register (if
> > > sun4i) or as threshold register (if not sun4i; in this case register at
> > > 0x200 is FIFO register).
> > >
> > > Tested with U-Boot and Linux kernel image built for Cubieboard and
> > > OrangePi PC.
> > >
> > > Signed-off-by: Strahinja Jankovic 
> > > ---
> > >  hw/sd/allwinner-sdhost.c | 67 ++--
> > >  include/hw/sd/allwinner-sdhost.h |  1 +
> > >  2 files changed, 47 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/hw/sd/allwinner-sdhost.c b/hw/sd/allwinner-sdhost.c
> > > index 455d6eabf6..51e5e90830 100644
> > > --- a/hw/sd/allwinner-sdhost.c
> > > +++ b/hw/sd/allwinner-sdhost.c
> > > @@ -65,7 +65,7 @@ enum {
> > >  REG_SD_DLBA   = 0x84,  /* Descriptor List Base Address */
> > >  REG_SD_IDST   = 0x88,  /* Internal DMA Controller Status */
> > >  REG_SD_IDIE   = 0x8C,  /* Internal DMA Controller IRQ Enable */
> > > -REG_SD_THLDC  = 0x100, /* Card Threshold Control */
> > > +REG_SD_THLDC  = 0x100, /* Card Threshold Control / FIFO (sun4i 
> > > only)*/
> > >  REG_SD_DSBD   = 0x10C, /* eMMC DDR Start Bit Detection Control */
> > >  REG_SD_RES_CRC= 0x110, /* Response CRC from card/eMMC */
> > >  REG_SD_DATA7_CRC  = 0x114, /* CRC Data 7 from card/eMMC */
> > > @@ -415,10 +415,29 @@ static void allwinner_sdhost_dma(AwSdHostState *s)
> > >  }
> > >  }
> > >
> > > +static uint32_t allwinner_sdhost_fifo_read(AwSdHostState *s)
> > > +{
> > > +uint32_t res = 0;
> > > +
> > > +if (sdbus_data_ready(>sdbus)) {
> > > +sdbus_read_data(>sdbus, , sizeof(uint32_t));
> > > +le32_to_cpus();
> > > +allwinner_sdhost_update_transfer_cnt(s, sizeof(uint32_t));
> > > +allwinner_sdhost_auto_stop(s);
> > > +allwinner_sdhost_update_irq(s);
> > > +} else {
> > > +qemu_log_mask(LOG_GUEST_ERROR, "%s: no data ready on SD bus\n",
> > > +  __func__);
> > > +}
> > > +
> > > +return res;
> > > +}
> > > +
> > >  static uint64_t allwinner_sdhost_read(void *opaque, hwaddr offset,
> > >unsigned size)
> > >  {
> > >  AwSdHostState *s = AW_SDHOST(opaque);
> > > +AwSdHostClass *sc = AW_SDHOST_GET_CLASS(s);
> > >  uint32_t res = 0;
> > >
> > >  switch (offset) {
> > > @@ -508,8 +527,12 @@ static uint64_t allwinner_sdhost_read(void *opaque, 
> > > hwaddr offset,
> > >  case REG_SD_IDIE:  /* Internal DMA Controller Interrupt Enable */
> > >  res = s->dmac_irq;
> > >  break;
> > > -case REG_SD_THLDC: /* Card Threshold Control */
> > > -res = s->card_threshold;
> > > +case REG_SD_THLDC: /* Card Threshold Control or FIFO register 
> > > (sun4i) */
> > > +if (sc->is_sun4i) {
> > > +res = allwinner_sdhost_fifo_read(s);
> > > +} else {
> > > +res = s->card_threshold;
> > > +}
> > >  break;
> > >  case REG_SD_DSBD:  /* eMMC DDR Start Bit Detection Control */
> > >  res = s->startbit_detect;
> > > @@ -531,16 +554,7 @@ static uint64_t allwinner_sdhost_read(void *opaque, 
> > > hwaddr offset,
> > >  res = s->status_crc;
> > >  break;
> > >  case REG_SD_FIFO:  /* Read/Write FIFO */
> > > -if (sdbus_data_ready(>sdbus)) {
> > > -sdbus_read_data(>sdbus, , sizeof(uint32_t));
> > > -le32_to_cpus();
> > > -allwinner_sdhost_update_transfer_cnt(s, sizeof(uint32_t));
> > > -allwinner_sdhost_auto_stop(s);
> > > -allwinner_sdhost_update_irq(s);
> > > -} else {
> > > -qemu_log_mask(LOG_GUEST_ERROR, "%s: no data ready on SD 
> > > bus\n",
> > > -  __func__);
> > > -}
> > > +res = allwinner_sdhost_fifo_read(s);
> >
> > Does the sun4i really have the FIFO at both addresses, or should
> > this one do something else for sun4i ?
>
> The sun4i sdhost actually has no registers with offset higher than
> 0x100 (offset of REG_SD_THLDC in patch), so REG_SD_DSBD, all
> REG_SD_*_CRC, REG_SD_CRC_STA and REG_SD_FIFO@0x200 should not be
> accessed from application code meant to run on sun4i. That is why I
> only 

Re: [PATCH] migration: check magic value for deciding the mapping of channels

2022-11-14 Thread Peter Xu
Manish,

On Thu, Nov 03, 2022 at 11:47:51PM +0530, manish.mishra wrote:
> Yes, but if we try to read early on main channel with tls enabled case it is 
> an issue. Sorry i may not have put above comment cleary. I will try to put 
> scenario step wise.
> 1. main channel is created and tls handshake is done for main channel.
> 2. Destionation side tries to read magic early on main channel in 
> migration_ioc_process_incoming but it is not yet sent by source.
> 3. Source has written magic to main channel file buffer but it is not yet 
> flushed, it is flushed first time in ram_save_setup, i mean data is sent on 
> channel only if qemu file buffer is full or explicitly flushed.
> 4. Source side blocks on multifd_send_sync_main in ram_save_setup before 
> flushing qemu file. But multifd_send_sync_main is blocked for sem_sync until 
> handshake is done for multiFD channels.
> 5. Destination side is still waiting for reading magic on main channel, so 
> unless we return from migration_ioc_process_incoming we can not accept new 
> channel, so handshake of multiFD channel is blocked.
> 6. So basically source is blocked on multiFD channels handshake before 
> sending data on main channel, but destination is blocked waiting for data 
> before it can acknowledge multiFD channels and do handshake, so it kind of 
> creates a deadlock situation.

Why is this issue only happening with TLS?  It sounds like it'll happen as
long as multifd enabled.

I'm also thinking whether we should flush in qemu_savevm_state_header() so
at least upgraded src qemu will always flush the headers if it never hurts.

-- 
Peter Xu




[qemu-devel]

2022-11-14 Thread Pawel Polawski
Hi Everyone,

I am trying to check qemu virtual cpu boundaries when running a custom
edk2 based firmware build. For that purpose I want to run qemu with more
than 1024 vCPU:
$QEMU
-accel kvm
-m 4G
-M q35,kernel-irqchip=on,smm=on
-smp cpus=1025,maxcpus=1025 -global mch.extended-tseg-mbytes=128
-drive if=pflash,format=raw,file=${CODE},readonly=on
-drive if=pflash,format=raw,file=${VARS}
-chardev stdio,id=fwlog
-device isa-debugcon,iobase=0x402,chardev=fwlog "$@"

The result is as follows:
QEMU emulator version 7.0.50 (v7.0.0-1651-g9cc1bf1ebc-dirty)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested
(1025) exceeds the recommended cpus supported by KVM (8)
Number of SMP cpus requested (1025) exceeds the maximum cpus supported by
KVM (1024)

It is not clear to me if I am hitting qemu limitation or KVM limitation
here.
I have changed hardcoded 1024 limits in hw/i386/* files but the limitation
is still presented.

Can someone advise what I should debug next looking for those vCPU limits?

Best regards,
Pawel

-- 

Paweł Poławski

Red Hat  Virtualization

ppola...@redhat.com
@RedHat    Red Hat
  Red Hat




[libnbd PATCH v2 16/23] examples: Update copy-libev to use 64-bit block status

2022-11-14 Thread Eric Blake
Although our use of "base:allocation" doesn't require the use of the
64-bit API for flags, we might perform slightly faster for a server
that does give us 64-bit extent lengths and honors larger nbd_zero
lengths.
---
 examples/copy-libev.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/examples/copy-libev.c b/examples/copy-libev.c
index 418d99f1..d8b45d87 100644
--- a/examples/copy-libev.c
+++ b/examples/copy-libev.c
@@ -94,7 +94,7 @@ struct request {
 };

 struct extent {
-uint32_t length;
+uint64_t length;
 bool zero;
 };

@@ -182,7 +182,7 @@ get_events(struct connection *c)

 static int
 extent_callback (void *user_data, const char *metacontext, uint64_t offset,
- uint32_t *entries, size_t nr_entries, int *error)
+ nbd_extent *entries, size_t nr_entries, int *error)
 {
 struct request *r = user_data;

@@ -197,22 +197,21 @@ extent_callback (void *user_data, const char 
*metacontext, uint64_t offset,
 return 1;
 }

-/* Libnbd returns uint32_t pair (length, flags) for each extent. */
-extents_len = nr_entries / 2;
+extents_len = nr_entries;

 extents = malloc (extents_len * sizeof *extents);
 if (extents == NULL)
 FAIL ("Cannot allocated extents: %s", strerror (errno));

 /* Copy libnbd entries to extents array. */
-for (int i = 0, j = 0; i < extents_len; i++, j=i*2) {
-extents[i].length = entries[j];
+for (int i = 0; i < extents_len; i++) {
+extents[i].length = entries[i].length;

 /* Libnbd exposes both ZERO and HOLE flags. We care only about
  * ZERO status, meaning we can copy this extent using efficinet
  * zero method.
  */
-extents[i].zero = (entries[j + 1] & LIBNBD_STATE_ZERO) != 0;
+extents[i].zero = (entries[i].flags & LIBNBD_STATE_ZERO) != 0;
 }

 DEBUG ("r%zu: received %zu extents for %s",
@@ -284,10 +283,10 @@ start_extents (struct request *r)
 DEBUG ("r%zu: start extents offset=%" PRIi64 " count=%zu",
r->index, offset, count);

-cookie = nbd_aio_block_status (
+cookie = nbd_aio_block_status_64 (
 src.nbd, count, offset,
-(nbd_extent_callback) { .callback=extent_callback,
-.user_data=r },
+(nbd_extent64_callback) { .callback=extent_callback,
+  .user_data=r },
 (nbd_completion_callback) { .callback=extents_completed,
 .user_data=r },
 0);
@@ -322,7 +321,7 @@ next_extent (struct request *r)
 limit = MIN (REQUEST_SIZE, size - offset);

 while (length < limit) {
-DEBUG ("e%zu: offset=%" PRIi64 " len=%" PRIu32 " zero=%d",
+DEBUG ("e%zu: offset=%" PRIi64 " len=%" PRIu64 " zero=%d",
extents_pos, offset, extents[extents_pos].length, is_zero);

 /* If this extent is too large, steal some data from it to
-- 
2.38.1




[PATCH v2 4/6] spec: Allow 64-bit block status results

2022-11-14 Thread Eric Blake
There are some potential extension metadata contexts that would
benefit from a 64-bit status value.  For example, Zoned Block Devices
(see https://zonedstorage.io/docs/linux/zbd-api) may want to return
the relative offset of where the next write will occur within the
zone, where a zone may be larger than 4G; creating a metacontext
"zbd:offset" that returns a 64-bit offset seems nicer than creating
two metacontexts "zbd:offset_lo" and "zbd:offset_hi" that each return
only 32 bits of the answer.

While the addition of extended headers superficially justified leaving
room in NBD_REPLY_TYPE_BLOCK_STATUS_EXT for the purpose of alignment,
it also has the nice benefit of being useful to allow extension
metadata contexts that can actually take advantage of the padding (and
remembering that since network byte order is big-endian, the padding
is in the correct location).  To ensure maximum backwards
compatibility, require that all contexts in the "base:" namespace (so
far, just "base:allocation") will only utilize 32-bit status.
---
 doc/proto.md | 62 +---
 1 file changed, 49 insertions(+), 13 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index fde1e70..14af48d 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -987,7 +987,10 @@ The procedure works as follows:
   during transmission, the client MUST select one or more metadata
   contexts with the `NBD_OPT_SET_META_CONTEXT` command. If needed, the
   client can use `NBD_OPT_LIST_META_CONTEXT` to list contexts that the
-  server supports.
+  server supports.  Most metadata contexts expose no more than 32 bits
+  of information, but some metadata contexts have associated data that
+  is 64 bits in length; using such contexts requires the client to
+  first negotiate extended headers with `NBD_OPT_EXTENDED_HEADERS`.
 - During transmission, a client can then indicate interest in metadata
   for a given region by way of the `NBD_CMD_BLOCK_STATUS` command,
   where *offset* and *length* indicate the area of interest. On
@@ -1045,7 +1048,7 @@ third-party namespaces are currently registered:
 Save in respect of the `base:` namespace described below, this specification
 requires no specific semantics of metadata contexts, except that all the
 information they provide MUST be representable within the flags field as
-defined for `NBD_REPLY_TYPE_BLOCK_STATUS`. Likewise, save in respect of
+defined for `NBD_REPLY_TYPE_BLOCK_STATUS_EXT`. Likewise, save in respect of
 the `base:` namespace, the syntax of query strings is not specified by this
 document, other than the recommendation that the empty leaf-name makes
 sense as a wildcard for a client query during `NBD_OPT_LIST_META_CONTEXT`,
@@ -1112,7 +1115,9 @@ should make no assumption as to its contents or stability.

 For the `base:allocation` context, the remainder of the flags field is
 reserved. Servers SHOULD set it to all-zero; clients MUST ignore
-unknown flags.
+unknown flags.  Because fewer than 32 flags are defined, this metadata
+context does not require the use of `NBD_OPT_EXTENDED_HEADERS`, and a
+server can use `NBD_REPLY_TYPE_BLOCK_STATUS` to return results.

 ## Values

@@ -1480,6 +1485,18 @@ of the newstyle negotiation.
 to do so, a server MAY send `NBD_REP_ERR_INVALID` or
 `NBD_REP_ERR_EXT_HEADER_REQD`.

+A server MAY support extension contexts that produce status values
+that require more than 32 bits.  The server MAY advertise such
+contexts even if the client has not yet negotiated extended
+headers, although it SHOULD then conclude the overall response
+with the `NBD_REP_ERR_EXT_HEADER_REQD` error to inform the client
+that extended headers are required to make full use of all
+contexts advertised.  However, since none of the contexts defined
+in the "base:" namespace provide more than 32 bits of status, a
+server MUST NOT use this failure mode when the response is limited
+to the "base:" namespace; nor may the server use this failure mode
+when the client has already negotiated extended headers.
+
 Data:
 - 32 bits, length of export name.  
 - String, name of export for which we wish to list metadata
@@ -1565,6 +1582,13 @@ of the newstyle negotiation.
 to do so, a server SHOULD send `NBD_REP_ERR_INVALID` or
 `NBD_REP_ERR_EXT_HEADER_REQD`.

+If a client requests a metadata context that utilizes 64-bit
+status, but has not yet negotiated extended headers, the server
+MUST either omit that context from its successful reply, or else
+fail the request with `NBD_REP_ERR_EXT_HEADER_REQD`.  The server
+MUST NOT use this failure for a client request that is limited to
+contexts in the "base:" namespace.
+
 A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless within the
 negotiation phase it sent `NBD_OPT_SET_META_CONTEXT` at least
 once, and where the final time it was sent, it referred to the
@@ -2028,16 +2052,23 @@ size.
   extent information at the first 

[PATCH v2 04/15] nbd: Add types for extended headers

2022-11-14 Thread Eric Blake
Add the constants and structs necessary for later patches to start
implementing the NBD_OPT_EXTENDED_HEADERS extension in both the client
and server, matching recent commit XXX[*] in the upstream nbd project.
This patch does not change any existing behavior, but merely sets the
stage.

This patch does not change the status quo that neither the client nor
server use a packed-struct representation for the request header.

Signed-off-by: Eric Blake 

---
[*]tweak commit message once nbd commit id available
---
 docs/interop/nbd.txt |  1 +
 include/block/nbd.h  | 74 
 nbd/common.c | 10 +-
 3 files changed, 65 insertions(+), 20 deletions(-)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index f5ca25174a..988c072697 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -69,3 +69,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
 NBD_CMD_FLAG_FAST_ZERO
 * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
 * 7.1: NBD_FLAG_CAN_MULTI_CONN for shareable writable exports
+* 7.2: NBD_OPT_EXTENDED_HEADERS
diff --git a/include/block/nbd.h b/include/block/nbd.h
index e357452a57..357121ce76 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -78,13 +78,24 @@ typedef struct NBDStructuredReplyChunk {
 uint32_t length; /* length of payload */
 } QEMU_PACKED NBDStructuredReplyChunk;

+typedef struct NBDExtendedReplyChunk {
+uint32_t magic;  /* NBD_EXTENDED_REPLY_MAGIC */
+uint16_t flags;  /* combination of NBD_REPLY_FLAG_* */
+uint16_t type;   /* NBD_REPLY_TYPE_* */
+uint64_t handle; /* request handle */
+uint64_t offset; /* request offset */
+uint64_t length; /* length of payload */
+} QEMU_PACKED NBDExtendedReplyChunk;
+
 typedef union NBDReply {
 NBDSimpleReply simple;
 NBDStructuredReplyChunk structured;
+NBDExtendedReplyChunk extended;
 struct {
-/* @magic and @handle fields have the same offset and size both in
- * simple reply and structured reply chunk, so let them be accessible
- * without ".simple." or ".structured." specification
+/*
+ * @magic and @handle fields have the same offset and size in all
+ * forms of replies, so let them be accessible without ".simple.",
+ * ".structured.", or ".extended." specifications.
  */
 uint32_t magic;
 uint32_t _skip;
@@ -117,15 +128,29 @@ typedef struct NBDStructuredError {
 typedef struct NBDStructuredMeta {
 /* header's length >= 12 (at least one extent) */
 uint32_t context_id;
-/* extents follows */
+/* NBDExtent extents[] follows, array length implied by header */
 } QEMU_PACKED NBDStructuredMeta;

-/* Extent chunk for NBD_REPLY_TYPE_BLOCK_STATUS */
+/* Extent array for NBD_REPLY_TYPE_BLOCK_STATUS */
 typedef struct NBDExtent {
 uint32_t length;
 uint32_t flags; /* NBD_STATE_* */
 } QEMU_PACKED NBDExtent;

+/* Header of NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
+typedef struct NBDStructuredMetaExt {
+/* header's length >= 24 (at least one extent) */
+uint32_t context_id;
+uint32_t count; /* header length must be count * 16 + 8 */
+/* NBDExtentExt extents[count] follows */
+} QEMU_PACKED NBDStructuredMetaExt;
+
+/* Extent array for NBD_REPLY_TYPE_BLOCK_STATUS_EXT */
+typedef struct NBDExtentExt {
+uint64_t length;
+uint64_t flags; /* NBD_STATE_* */
+} QEMU_PACKED NBDExtentExt;
+
 /* Transmission (export) flags: sent from server to client during handshake,
but describe what will happen during transmission */
 enum {
@@ -178,6 +203,7 @@ enum {
 #define NBD_OPT_STRUCTURED_REPLY  (8)
 #define NBD_OPT_LIST_META_CONTEXT (9)
 #define NBD_OPT_SET_META_CONTEXT  (10)
+#define NBD_OPT_EXTENDED_HEADERS  (11)

 /* Option reply types. */
 #define NBD_REP_ERR(value) ((UINT32_C(1) << 31) | (value))
@@ -195,6 +221,8 @@ enum {
 #define NBD_REP_ERR_UNKNOWN NBD_REP_ERR(6)  /* Export unknown */
 #define NBD_REP_ERR_SHUTDOWNNBD_REP_ERR(7)  /* Server shutting down */
 #define NBD_REP_ERR_BLOCK_SIZE_REQD NBD_REP_ERR(8)  /* Need INFO_BLOCK_SIZE */
+#define NBD_REP_ERR_TOO_BIG NBD_REP_ERR(9)  /* Payload size overflow */
+#define NBD_REP_ERR_EXT_HEADER_REQD NBD_REP_ERR(10) /* Need extended headers */

 /* Info types, used during NBD_REP_INFO */
 #define NBD_INFO_EXPORT 0
@@ -203,12 +231,14 @@ enum {
 #define NBD_INFO_BLOCK_SIZE 3

 /* Request flags, sent from client to server during transmission phase */
-#define NBD_CMD_FLAG_FUA(1 << 0) /* 'force unit access' during write */
-#define NBD_CMD_FLAG_NO_HOLE(1 << 1) /* don't punch hole on zero run */
-#define NBD_CMD_FLAG_DF (1 << 2) /* don't fragment structured read */
-#define NBD_CMD_FLAG_REQ_ONE(1 << 3) /* only one extent in BLOCK_STATUS
-  * reply chunk */
-#define NBD_CMD_FLAG_FAST_ZERO  (1 << 4) /* fail if WRITE_ZEROES is not fast */
+#define NBD_CMD_FLAG_FUA (1 << 0) 

Re: [PATCH] hw/intc: sifive_plic: Renumber the S irqs for numa support

2022-11-14 Thread Philippe Mathieu-Daudé

On 11/11/22 13:19, Frédéric Pétrot wrote:

Commit 40244040 changed the way the S irqs are numbered. This breaks when


40244040a7 in case?


using numa configuration, e.g.:
./qemu-system-riscv64 -nographic -machine virt,dumpdtb=numa-tree.dtb \
   -m 2G -smp cpus=16 \
  -object memory-backend-ram,id=mem0,size=512M \
  -object memory-backend-ram,id=mem1,size=512M \
  -object memory-backend-ram,id=mem2,size=512M \
  -object memory-backend-ram,id=mem3,size=512M \
  -numa node,cpus=0-3,memdev=mem0,nodeid=0 \
  -numa node,cpus=4-7,memdev=mem1,nodeid=1 \
  -numa node,cpus=8-11,memdev=mem2,nodeid=2 \
  -numa node,cpus=12-15,memdev=mem3,nodeid=3
leads to:
Unexpected error in object_property_find_err() at ../qom/object.c:1304:
qemu-system-riscv64: Property 'riscv.sifive.plic.unnamed-gpio-out[8]' not
found

This patch makes the nubering of the S irqs identical to what it was before.

Signed-off-by: Frédéric Pétrot 
---
  hw/intc/sifive_plic.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index c2dfacf028..89d2122742 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -480,7 +480,7 @@ DeviceState *sifive_plic_create(hwaddr addr, char 
*hart_config,
qdev_get_gpio_in(DEVICE(cpu), IRQ_M_EXT));
  }
  if (plic->addr_config[i].mode == PLICMode_S) {
-qdev_connect_gpio_out(dev, cpu_num,
+qdev_connect_gpio_out(dev, cpu_num - plic->hartid_base,
qdev_get_gpio_in(DEVICE(cpu), IRQ_S_EXT));
  }
  }


Oops.

Eventually we could unify the style:

-- >8 --
@@ -476,11 +476,11 @@ DeviceState *sifive_plic_create(hwaddr addr, char 
*hart_config,

 CPUState *cpu = qemu_get_cpu(cpu_num);

 if (plic->addr_config[i].mode == PLICMode_M) {
-qdev_connect_gpio_out(dev, num_harts - plic->hartid_base + 
cpu_num,

+qdev_connect_gpio_out(dev, cpu_num - hartid_base + num_harts,
   qdev_get_gpio_in(DEVICE(cpu), 
IRQ_M_EXT));

 }
 if (plic->addr_config[i].mode == PLICMode_S) {
-qdev_connect_gpio_out(dev, cpu_num,
+qdev_connect_gpio_out(dev, cpu_num - hartid_base,
   qdev_get_gpio_in(DEVICE(cpu), 
IRQ_S_EXT));

 }
 }
---

Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v7 11/12] multifd: Zero pages transmission

2022-11-14 Thread Juan Quintela
Leonardo Brás  wrote:
> On Tue, 2022-08-02 at 08:39 +0200, Juan Quintela wrote:
>> This implements the zero page dection and handling.
>> 
>> Signed-off-by: Juan Quintela 

>> @@ -358,6 +365,18 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
>> *p, Error **errp)
>>  p->normal[i] = offset;
>>  }
>>  
>> +for (i = 0; i < p->zero_num; i++) {
>> +uint64_t offset = be64_to_cpu(packet->offset[p->normal_num + i]);
>> +
>> +if (offset > (block->used_length - p->page_size)) {
>> +error_setg(errp, "multifd: offset too long %" PRIu64
>> +   " (max " RAM_ADDR_FMT ")",
>> +   offset, block->used_length);
>> +return -1;
>> +}
>> +p->zero[i] = offset;
>> +}
>> +
>>  return 0;
>>  }
>
> IIUC ram_addr_t is supposed to be the address size for the architecture, 
> mainly
> being 32 or 64 bits. So packet->offset[i] is always u64, and p->zero[i] 
> possibly
> being u32 or u64.
>
> Since both local variables and packet->offset[i] are 64-bit, there is no 
> issue.
>
> But on 'p->zero[i] = offset' we can have 'u32 = u64', and this should raise a
> warning (or am I missing something?).

I don't really know what to do here.
The problem is only theoretical (in the long, long past, we have
supported migrating between different architectures, but we aren't
testing anymore).

And because it was a pain in the ass, we define it as:

/* address in the RAM (different from a physical address) */
#if defined(CONFIG_XEN_BACKEND)
typedef uint64_t ram_addr_t;
#  define RAM_ADDR_MAX UINT64_MAX
#  define RAM_ADDR_FMT "%" PRIx64
#else
typedef uintptr_t ram_addr_t;
#  define RAM_ADDR_MAX UINTPTR_MAX
#  define RAM_ADDR_FMT "%" PRIxPTR
#endif

So I am pretty sure that almost nothing uses 32bits for it now (I
haven't checked lately, but I guess that nobody is really using/testing
xen on 32 bits).

I don't really know.  But it could only happens when you are migrating
from Xen 64 bits to Xen 32 bits, I don't really know if that even work.

I will give it a try to change normal/zero to u64.

Thanks, Juan.




Re: [PATCH v6 07/10] vdpa: Add asid parameter to vhost_vdpa_dma_map/unmap

2022-11-14 Thread Jason Wang



在 2022/11/11 21:02, Eugenio Perez Martin 写道:

On Fri, Nov 11, 2022 at 8:41 AM Jason Wang  wrote:


在 2022/11/10 21:22, Eugenio Perez Martin 写道:

On Thu, Nov 10, 2022 at 6:51 AM Jason Wang  wrote:

On Wed, Nov 9, 2022 at 1:08 AM Eugenio Pérez  wrote:

So the caller can choose which ASID is destined.

No need to update the batch functions as they will always be called from
memory listener updates at the moment. Memory listener updates will
always update ASID 0, as it's the passthrough ASID.

All vhost devices's ASID are 0 at this moment.

Signed-off-by: Eugenio Pérez 
---
v5:
* Solve conflict, now vhost_vdpa_svq_unmap_ring returns void
* Change comment on zero initialization.

v4: Add comment specifying behavior if device does not support _F_ASID

v3: Deleted unneeded space
---
   include/hw/virtio/vhost-vdpa.h |  8 +---
   hw/virtio/vhost-vdpa.c | 29 +++--
   net/vhost-vdpa.c   |  6 +++---
   hw/virtio/trace-events |  4 ++--
   4 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index d85643..6560bb9d78 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -29,6 +29,7 @@ typedef struct vhost_vdpa {
   int index;
   uint32_t msg_type;
   bool iotlb_batch_begin_sent;
+uint32_t address_space_id;

So the trick is let device specific code to zero this during allocation?


Yes, but I don't see how that is a trick :). All other parameters also
trust it to be 0 at allocation.


   MemoryListener listener;
   struct vhost_vdpa_iova_range iova_range;
   uint64_t acked_features;
@@ -42,8 +43,9 @@ typedef struct vhost_vdpa {
   VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
   } VhostVDPA;

-int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
-   void *vaddr, bool readonly);
-int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, hwaddr iova, hwaddr size);
+int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+   hwaddr size, void *vaddr, bool readonly);
+int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+ hwaddr size);

   #endif
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 23efb8f49d..8fd32ba32b 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -72,22 +72,24 @@ static bool 
vhost_vdpa_listener_skipped_section(MemoryRegionSection *section,
   return false;
   }

-int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
-   void *vaddr, bool readonly)
+int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+   hwaddr size, void *vaddr, bool readonly)
   {
   struct vhost_msg_v2 msg = {};
   int fd = v->device_fd;
   int ret = 0;

   msg.type = v->msg_type;
+msg.asid = asid; /* 0 if vdpa device does not support asid */

The comment here is confusing. If this is a requirement, we need either

1) doc this

or

2) perform necessary checks in the function itself.


I only documented it in vhost_vdpa_dma_unmap and now I realize it.
Would it work to just copy that comment here?


Probably, and let's move the comment above the function definition.



   msg.iotlb.iova = iova;
   msg.iotlb.size = size;
   msg.iotlb.uaddr = (uint64_t)(uintptr_t)vaddr;
   msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
   msg.iotlb.type = VHOST_IOTLB_UPDATE;

-   trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.iotlb.iova, msg.iotlb.size,
-msg.iotlb.uaddr, msg.iotlb.perm, msg.iotlb.type);
+trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.asid, msg.iotlb.iova,
+ msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm,
+ msg.iotlb.type);

   if (write(fd, , sizeof(msg)) != sizeof(msg)) {
   error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -98,18 +100,24 @@ int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, 
hwaddr size,
   return ret;
   }

-int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, hwaddr iova, hwaddr size)
+int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+ hwaddr size)
   {
   struct vhost_msg_v2 msg = {};
   int fd = v->device_fd;
   int ret = 0;

   msg.type = v->msg_type;
+/*
+ * The caller must set asid = 0 if the device does not support asid.
+ * This is not an ABI break since it is set to 0 by the initializer anyway.
+ */
+msg.asid = asid;
   msg.iotlb.iova = iova;
   msg.iotlb.size = size;
   msg.iotlb.type = VHOST_IOTLB_INVALIDATE;

-trace_vhost_vdpa_dma_unmap(v, fd, msg.type, msg.iotlb.iova,
+trace_vhost_vdpa_dma_unmap(v, fd, msg.type, msg.asid, msg.iotlb.iova,
  msg.iotlb.size, msg.iotlb.type);

   if 

Re: [PATCH v6 10/10] vdpa: Always start CVQ in SVQ mode

2022-11-14 Thread Jason Wang



在 2022/11/11 22:38, Eugenio Perez Martin 写道:

On Fri, Nov 11, 2022 at 9:03 AM Jason Wang  wrote:


在 2022/11/11 00:07, Eugenio Perez Martin 写道:

On Thu, Nov 10, 2022 at 7:25 AM Jason Wang  wrote:

在 2022/11/9 01:07, Eugenio Pérez 写道:

Isolate control virtqueue in its own group, allowing to intercept control
commands but letting dataplane run totally passthrough to the guest.

I think we need to tweak the title to "vdpa: Always start CVQ in SVQ
mode if possible". Since SVQ for CVQ can't be enabled without ASID support?


Yes, I totally agree.


Btw, I wonder if it's worth to remove the "x-" prefix for the shadow
virtqueue. It can help for the devices without ASID support but want do
live migration.


Sure I can propose on top.


Signed-off-by: Eugenio Pérez 
---
v6:
* Disable control SVQ if the device does not support it because of
features.

v5:
* Fixing the not adding cvq buffers when x-svq=on is specified.
* Move vring state in vhost_vdpa_get_vring_group instead of using a
 parameter.
* Rename VHOST_VDPA_NET_CVQ_PASSTHROUGH to VHOST_VDPA_NET_DATA_ASID

v4:
* Squash vhost_vdpa_cvq_group_is_independent.
* Rebased on last CVQ start series, that allocated CVQ cmd bufs at load
* Do not check for cvq index on vhost_vdpa_net_prepare, we only have one
 that callback registered in that NetClientInfo.

v3:
* Make asid related queries print a warning instead of returning an
 error and stop the start of qemu.
---
hw/virtio/vhost-vdpa.c |   3 +-
net/vhost-vdpa.c   | 138 ++---
2 files changed, 132 insertions(+), 9 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index e3914fa40e..6401e7efb1 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -648,7 +648,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
{
uint64_t features;
uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
-0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH;
+0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
+0x1ULL << VHOST_BACKEND_F_IOTLB_ASID;
int r;

if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, )) {
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 02780ee37b..7245ea70c6 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -38,6 +38,9 @@ typedef struct VhostVDPAState {
void *cvq_cmd_out_buffer;
virtio_net_ctrl_ack *status;

+/* Number of address spaces supported by the device */
+unsigned address_space_num;

I'm not sure this is the best place to store thing like this since it
can cause confusion. We will have multiple VhostVDPAState when
multiqueue is enabled.


I think we can delete this and ask it on each device start.


+
/* The device always have SVQ enabled */
bool always_svq;
bool started;
@@ -101,6 +104,9 @@ static const uint64_t vdpa_svq_device_features =
BIT_ULL(VIRTIO_NET_F_RSC_EXT) |
BIT_ULL(VIRTIO_NET_F_STANDBY);

+#define VHOST_VDPA_NET_DATA_ASID 0
+#define VHOST_VDPA_NET_CVQ_ASID 1
+
VHostNetState *vhost_vdpa_get_vhost_net(NetClientState *nc)
{
VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
@@ -242,6 +248,34 @@ static NetClientInfo net_vhost_vdpa_info = {
.check_peer_type = vhost_vdpa_check_peer_type,
};

+static uint32_t vhost_vdpa_get_vring_group(int device_fd, unsigned vq_index)
+{
+struct vhost_vring_state state = {
+.index = vq_index,
+};
+int r = ioctl(device_fd, VHOST_VDPA_GET_VRING_GROUP, );
+
+return r < 0 ? 0 : state.num;

Assume 0 when ioctl() fail is probably not a good idea: errors in ioctl
might be hidden. It would be better to fallback to 0 when ASID is not
supported.


Did I misunderstand you on [1]?


Nope. I think I was wrong at that time then :( Sorry for that.

We should differ from the case

1) no ASID support so 0 is assumed

2) something wrong in the case of ioctl, it's not necessarily a ENOTSUPP.


What action should we take here? Isn't it better to disable SVQ and
let the device run?



It should fail the function instead of assuming 0 like how other vhost 
ioctl work.






+}
+
+static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v,
+   unsigned vq_group,
+   unsigned asid_num)
+{
+struct vhost_vring_state asid = {
+.index = vq_group,
+.num = asid_num,
+};
+int ret;
+
+ret = ioctl(v->device_fd, VHOST_VDPA_SET_GROUP_ASID, );
+if (unlikely(ret < 0)) {
+warn_report("Can't set vq group %u asid %u, errno=%d (%s)",
+asid.index, asid.num, errno, g_strerror(errno));
+}
+return ret;
+}
+
static void vhost_vdpa_cvq_unmap_buf(struct vhost_vdpa *v, void *addr)
{
VhostIOVATree *tree = v->iova_tree;
@@ -316,11 +350,54 @@ dma_map_err:
static int vhost_vdpa_net_cvq_start(NetClientState *nc)
{
VhostVDPAState *s;
-int r;
+struct 

Re: [PATCH v7 10/12] multifd: Support for zero pages transmission

2022-11-14 Thread Juan Quintela
Leonardo Brás  wrote:

...

>> @@ -712,6 +726,7 @@ static void *multifd_send_thread(void *opaque)
>>  qemu_mutex_lock(>mutex);
>>  p->num_packets++;
>>  p->total_normal_pages += p->normal_num;
>> +p->total_zero_pages += p->zero_num;
>
> I can see it getting declared, incremented and used. But where is it 
> initialized
> in zero? I mean, should it not have 'p->total_normal_pages = 0;' somewhere in
> setup?

int multifd_save_setup(Error **errp)
{


thread_count = migrate_multifd_channels();
multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);

You can see here, that we setup everything to zero.  We only need to
initialize explicitely whatever is not zero.


> (I understand multifd_save_setup() allocates a multifd_send_state->params with
> g_new0(),but other variables are zeroed there, like p->pending_job and 
> p->write_flags, so why not?)   

Humm, I think that it is better to do it the other way around.  Remove
the initilazations that are not zero.  That way we only put whatever is
not zero.


Thanks, Juan.




[PATCH v2 15/15] RFC: nbd/server: Send 64-bit hole chunk

2022-11-14 Thread Eric Blake
Since we cap NBD_CMD_READ requests to 32M, we never have a reason to
send a 64-bit chunk type for a hole; but it is worth producing these
for interoperability testing of clients that want extended headers.
---
 nbd/server.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index cd280f1721..04cb172f97 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2112,9 +2112,13 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
 if (status & BDRV_BLOCK_ZERO) {
 NBDReply hdr;
 NBDStructuredReadHole chunk;
+NBDStructuredReadHoleExt chunk_ext;
 struct iovec iov[] = {
 {.iov_base = },
-{.iov_base = , .iov_len = sizeof(chunk)},
+{.iov_base = client->extended_headers ? _ext
+ : (void *) ,
+ .iov_len = client->extended_headers ? sizeof(chunk_ext)
+ : sizeof(chunk)},
 };

 trace_nbd_co_send_structured_read_hole(request->handle,
@@ -2122,9 +2126,17 @@ static int coroutine_fn 
nbd_co_send_sparse_read(NBDClient *client,
pnum);
 set_be_chunk(client, [0],
  final ? NBD_REPLY_FLAG_DONE : 0,
- NBD_REPLY_TYPE_OFFSET_HOLE, request, iov[1].iov_len);
-stq_be_p(, offset + progress);
-stl_be_p(, pnum);
+ client->extended_headers
+ ? NBD_REPLY_TYPE_OFFSET_HOLE_EXT
+ : NBD_REPLY_TYPE_OFFSET_HOLE,
+ request, iov[1].iov_len);
+if (client->extended_headers) {
+stq_be_p(_ext.offset, offset + progress);
+stq_be_p(_ext.length, pnum);
+} else {
+stq_be_p(, offset + progress);
+stl_be_p(, pnum);
+}
 ret = nbd_co_send_iov(client, iov, 2, errp);
 } else {
 ret = blk_pread(exp->common.blk, offset + progress, pnum,
-- 
2.38.1




Re: [PATCH v2] capstone: use instead of

2022-11-14 Thread Daniel P . Berrangé
On Sun, Nov 13, 2022 at 11:09:42PM +0300, Michael Tokarev wrote:
> The upcoming capstone 5.0 drops support for the old way
> of including its header, due to this change:
> https://github.com/capstone-engine/capstone/commit/6656bcb63ab4e87dc6079bd6b6b12cc8dd9b2ad8
> The official way is to use 
> 
> This change has already been proposed before, see
> https://patchwork.kernel.org/project/qemu-devel/patch/20180215173539.11033-1-f4...@amsat.org/
> but it didn't find its way into qemu at that time.
> 
> On current systems, using  works
> now (despite the pkg-config-supplied -I/usr/include/capstone) -
> since on all systems capstone headers are put into capstone/
> subdirectory of a system include dir. So this change is
> compatible with both the obsolete way of including it
> and the only future way.

AFAIR, macOS HomeBrew does not put anything into the system
include dir, and always requires -I flags to be correct.

> diff --git a/meson.build b/meson.build
> index cf3e517e56..6f34c963f7 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2680,12 +2680,7 @@ if not get_option('capstone').auto() or have_system or 
> have_user
>capstone = dependency('capstone', version: '>=3.0.5',
>  kwargs: static_kwargs, method: 'pkg-config',
>  required: get_option('capstone'))
> -
> -  # Some versions of capstone have broken pkg-config file
> -  # that reports a wrong -I path, causing the #include to
> -  # fail later. If the system has such a broken version
> -  # do not use it.
> -  if capstone.found() and not cc.compiles('#include ',
> +  if capstone.found() and not cc.compiles('#include ',
>dependencies: [capstone])

To retain back compat this could probe for both ways

if capstone.found()
if cc.compiles('#include ',
   dependencies: [capstone])
   ...
else if cc.compiles('#include ',
dependencies: [capstone])
   ...

then, the source file can try the correct #include based on what
we detect works here.


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PULL 00/11] Block layer patches

2022-11-14 Thread John Snow
On Mon, Nov 14, 2022 at 5:56 AM Kevin Wolf  wrote:
>
> Am 11.11.2022 um 20:20 hat Stefan Hajnoczi geschrieben:
> > > Hanna Reitz (9):
> > >   block/mirror: Do not wait for active writes
> > >   block/mirror: Drop mirror_wait_for_any_operation()
> > >   block/mirror: Fix NULL s->job in active writes
> > >   iotests/151: Test that active mirror progresses
> > >   iotests/151: Test active requests on mirror start
> > >   block: Make bdrv_child_get_parent_aio_context I/O
> > >   block-backend: Update ctx immediately after root
> > >   block: Start/end drain on correct AioContext
> > >   tests/stream-under-throttle: New test
> >
> > Hi Hanna,
> > This test is broken, probably due to the minimum Python version:
> > https://gitlab.com/qemu-project/qemu/-/jobs/3311521303
>
> This is exactly the problem I saw with running linters in a gating CI,
> but not during 'make check'. And of course, we're hitting it during the
> -rc phase now. :-(

I mean. I'd love to have it run in make check too. The alternative was
never seeing this *anywhere* ...

...but I'm sorry it's taken me so long to figure out how to get this
stuff to work in "make check" and also from manual iotests runs
without adding any kind of setup that you have to manage. It's just
fiddly, sorry :(

>
> But yes, it seems that asyncio.TimeoutError should be used instead of
> asyncio.exceptions.TimeoutError, and Python 3.6 has only the former.
> I'll fix this up and send a v2 if it fixes check-python-pipenv.

Hopefully this goes away when we drop 3.6. I want to, but I recall
there was some question about some platforms that don't support 3.7+
"by default" and how annoying that was or wasn't. We're almost a year
out from 3.6 being EOL, so maybe after this release it's worth a crack
to see how painful it is to move on.

>
> Kevin
>




[PATCH v2 02/15] nbd/server: Prepare for alternate-size headers

2022-11-14 Thread Eric Blake
An upcoming NBD extension wants to add the ability to do 64-bit effect
lengths in requests.  As part of that extension, the size of the reply
headers will change in order to permit a 64-bit length in the reply
for symmetry [*].  Additionally, where the reply header is currently
16 bytes for simple reply, and 20 bytes for structured reply; with the
extension enabled, there will only be one structured reply type, of 32
bytes.  Since we are already wired up to use iovecs, it is easiest to
allow for this change in header size by splitting each structured
reply across two iovecs, one for the header (which will become
variable-length in a future patch according to client negotiation),
and the other for the payload, and removing the header from the
payload struct definitions.  Interestingly, the client side code never
utilized the packed types, so only the server code needs to be
updated.

[*] Note that on the surface, this is because some future server might
permit a 4G+ NBD_CMD_READ and need to reply with that much data in one
transaction.  But even though the extended reply length is widened to
64 bits, at present we will still never send a reply payload larger
than just over 32M (the maximum buffer we allow in NBD_CMD_READ; and
we cap the number of extents we are willing to report in
NBD_CMD_BLOCK_STATUS); for a future server to truly support a 4G read
in one transaction, NBD_OPT_GO would need an extension of a new
NBD_INFO_ field that provides for a 64-bit maximum transaction length.
Where 64-bit fields really matter in the extension is in a later patch
adding 64-bit support into a counterpart for REPLY_TYPE_BLOCK_STATUS.

Signed-off-by: Eric Blake 
---
 include/block/nbd.h |  8 +++---
 nbd/server.c| 64 -
 2 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 4ede3b2bd0..1330dbc18b 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -95,28 +95,28 @@ typedef union NBDReply {

 /* Header of chunk for NBD_REPLY_TYPE_OFFSET_DATA */
 typedef struct NBDStructuredReadData {
-NBDStructuredReplyChunk h; /* h.length >= 9 */
+/* header's .length >= 9 */
 uint64_t offset;
 /* At least one byte of data payload follows, calculated from h.length */
 } QEMU_PACKED NBDStructuredReadData;

 /* Complete chunk for NBD_REPLY_TYPE_OFFSET_HOLE */
 typedef struct NBDStructuredReadHole {
-NBDStructuredReplyChunk h; /* h.length == 12 */
+/* header's length == 12 */
 uint64_t offset;
 uint32_t length;
 } QEMU_PACKED NBDStructuredReadHole;

 /* Header of all NBD_REPLY_TYPE_ERROR* errors */
 typedef struct NBDStructuredError {
-NBDStructuredReplyChunk h; /* h.length >= 6 */
+/* header's length >= 6 */
 uint32_t error;
 uint16_t message_length;
 } QEMU_PACKED NBDStructuredError;

 /* Header of NBD_REPLY_TYPE_BLOCK_STATUS */
 typedef struct NBDStructuredMeta {
-NBDStructuredReplyChunk h; /* h.length >= 12 (at least one extent) */
+/* header's length >= 12 (at least one extent) */
 uint32_t context_id;
 /* extents follows */
 } QEMU_PACKED NBDStructuredMeta;
diff --git a/nbd/server.c b/nbd/server.c
index ada16089f3..37f9c21d20 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1888,9 +1888,12 @@ static int coroutine_fn nbd_co_send_iov(NBDClient 
*client, struct iovec *iov,
 return ret;
 }

-static inline void set_be_simple_reply(NBDSimpleReply *reply, uint64_t error,
-   uint64_t handle)
+static inline void set_be_simple_reply(NBDClient *client, struct iovec *iov,
+   uint64_t error, uint64_t handle)
 {
+NBDSimpleReply *reply = iov->iov_base;
+
+iov->iov_len = sizeof(*reply);
 stl_be_p(>magic, NBD_SIMPLE_REPLY_MAGIC);
 stl_be_p(>error, error);
 stq_be_p(>handle, handle);
@@ -1903,23 +1906,27 @@ static int nbd_co_send_simple_reply(NBDClient *client,
 size_t len,
 Error **errp)
 {
-NBDSimpleReply reply;
+NBDReply hdr;
 int nbd_err = system_errno_to_nbd_errno(error);
 struct iovec iov[] = {
-{.iov_base = , .iov_len = sizeof(reply)},
+{.iov_base = },
 {.iov_base = data, .iov_len = len}
 };

 trace_nbd_co_send_simple_reply(handle, nbd_err, nbd_err_lookup(nbd_err),
len);
-set_be_simple_reply(, nbd_err, handle);
+set_be_simple_reply(client, [0], nbd_err, handle);

 return nbd_co_send_iov(client, iov, len ? 2 : 1, errp);
 }

-static inline void set_be_chunk(NBDStructuredReplyChunk *chunk, uint16_t flags,
-uint16_t type, uint64_t handle, uint32_t 
length)
+static inline void set_be_chunk(NBDClient *client, struct iovec *iov,
+uint16_t flags, uint16_t type,
+uint64_t handle, uint32_t length)
 {
+

[libnbd PATCH v2 23/23] RFC: pread: Accept 64-bit holes

2022-11-14 Thread Eric Blake
Even though we don't currently allow the user to request NBD_CMD_READ
with more than 64M (and even if we did, our API signature caps us at
SIZE_MAX, which is 32 bits on a 32-bit machine), upstream NBD commit
XXX[*] states that for symmetry with 64-bit requests, extended header
clients must be prepared for a server response with a 64-bit hole,
even if the client never makes a read request that large.  Note that
we don't have to change the signature of the callback for
nbd_pread_structured; nor is it worth adding a 64-bit counterpart to
LIBNBD_READ_HOLE, because it is unlikely that a user callback will
ever need to distinguish between which size was sent over the wire,
when the value is always less than 32 bits.  Also note that the recent
NBD spec changes to add 64-bits did state that servers may allow
clients to request a read of larger than the max block size, but if
such read is not rejected with EOVERFLOW or EINVAL, then the reply
will be divided into chunks so that no chunk sends more than a max
block size payload.

While we cannot guarantee which size structured reply the server will
use, it is easy enough to handle both sizes, but tag the read with
EPROTO for a non-compliant server that sends wide replies when
extended headers were not negotiated.  A more likely reason that a
server may choose to send 64-bit hole chunks even for a 32-bit hole is
because the extended hole payload has nicer power-of-2 sizing.

---

[*] FIXME: Update this with the actual commit id, if upstream NBD even
goes with this option.
---
 lib/internal.h  |  1 +
 lib/nbd-protocol.h  |  6 +
 generator/states-reply-structured.c | 36 -
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/lib/internal.h b/lib/internal.h
index ac8d99c4..64d1941c 100644
--- a/lib/internal.h
+++ b/lib/internal.h
@@ -250,6 +250,7 @@ struct nbd_handle {
   union {
 struct nbd_structured_reply_offset_data offset_data;
 struct nbd_structured_reply_offset_hole offset_hole;
+struct nbd_structured_reply_offset_hole_ext offset_hole_ext;
 struct nbd_structured_reply_block_status_hdr bs_hdr;
 struct nbd_structured_reply_block_status_ext_hdr bs_ext_hdr;
 struct {
diff --git a/lib/nbd-protocol.h b/lib/nbd-protocol.h
index 2d1fabd0..2d1a3202 100644
--- a/lib/nbd-protocol.h
+++ b/lib/nbd-protocol.h
@@ -247,6 +247,11 @@ struct nbd_structured_reply_offset_hole {
   uint32_t length;  /* Length of hole. */
 } NBD_ATTRIBUTE_PACKED;

+struct nbd_structured_reply_offset_hole_ext {
+  uint64_t offset;
+  uint64_t length;  /* Length of hole. */
+} NBD_ATTRIBUTE_PACKED;
+
 /* NBD_REPLY_TYPE_BLOCK_STATUS block descriptor. */
 struct nbd_block_descriptor {
   uint32_t length;  /* length of block */
@@ -292,6 +297,7 @@ struct nbd_structured_reply_error {
 #define NBD_REPLY_TYPE_NONE 0
 #define NBD_REPLY_TYPE_OFFSET_DATA  1
 #define NBD_REPLY_TYPE_OFFSET_HOLE  2
+#define NBD_REPLY_TYPE_OFFSET_HOLE_EXT  3
 #define NBD_REPLY_TYPE_BLOCK_STATUS 5
 #define NBD_REPLY_TYPE_BLOCK_STATUS_EXT 6
 #define NBD_REPLY_TYPE_ERRORNBD_REPLY_TYPE_ERR (1)
diff --git a/generator/states-reply-structured.c 
b/generator/states-reply-structured.c
index 7e313b5a..e338bf74 100644
--- a/generator/states-reply-structured.c
+++ b/generator/states-reply-structured.c
@@ -28,15 +28,16 @@
  * requesting command.
  */
 static bool
-structured_reply_in_bounds (uint64_t offset, uint32_t length,
+structured_reply_in_bounds (uint64_t offset, uint64_t length,
 const struct command *cmd)
 {
   if (offset < cmd->offset ||
   offset >= cmd->offset + cmd->count ||
-  offset + length > cmd->offset + cmd->count) {
+  length > cmd->offset + cmd->count ||
+  offset > cmd->offset + cmd->count - length) {
 set_error (0, "range of structured reply is out of bounds, "
"offset=%" PRIu64 ", cmd->offset=%" PRIu64 ", "
-   "length=%" PRIu32 ", cmd->count=%" PRIu64 ": "
+   "length=%" PRIu64 ", cmd->count=%" PRIu64 ": "
"this is likely to be a bug in the NBD server",
offset, cmd->offset, length, cmd->count);
 return false;
@@ -141,6 +142,21 @@  REPLY.STRUCTURED_REPLY.CHECK:
 SET_NEXT_STATE (%RECV_OFFSET_HOLE);
 break;

+  case NBD_REPLY_TYPE_OFFSET_HOLE_EXT:
+if (cmd->type != NBD_CMD_READ ||
+length != sizeof h->sbuf.reply.payload.offset_hole_ext)
+  goto resync;
+if (!h->extended_headers) {
+  debug (h, "unexpected 64-bit hole without extended headers, "
+ "this is probably a server bug");
+  if (cmd->error == 0)
+cmd->error = EPROTO;
+}
+h->rbuf = >sbuf.reply.payload.offset_hole_ext;
+h->rlen = sizeof h->sbuf.reply.payload.offset_hole_ext;
+SET_NEXT_STATE (%RECV_OFFSET_HOLE);
+break;
+
   case NBD_REPLY_TYPE_BLOCK_STATUS:
 if 

Re: [PATCH v2 02/15] migration: Add postcopy_preempt_active()

2022-11-14 Thread Juan Quintela
Peter Xu  wrote:
> Add the helper to show that postcopy preempt enabled, meanwhile active.
>
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 




[cross-project PATCH v2] NBD 64-bit extensions

2022-11-14 Thread Eric Blake
This is a cover letter for a set of multi-project patch series all
designed to implement 64-bit operations in NBD, and demonstrate
interoperability of the new extension between projects.

v1 of the project was attempted nearly a year ago:
https://lists.nongnu.org/archive/html/qemu-devel/2021-12/msg00453.html

Since then, I've addressed a lot of preliminary cleanups in libnbd to
make it easier to test things, and incorporated review comments issued
on v1, including:

- no orthogonality between simple/structured replies and 64-bit mode:
  once extended headers are negotiated, all transmission traffic (both
  from client to server and server to client) uses just one header
  size

- add support for the client to pass a payload on commands to the
  server, and demonstrate it by further implementing a way to pass a
  flag with NBD_CMD_BLOCK_STATUS that says the client is passing a
  payload to request status of just a subset of the negotiated
  contexts, rather than all possible contexts that were earlier
  negotiated during NBD_OPT_SET_META_CONTEXT

- tweaks to the header layouts: tweak block status to provide 64-bit
  flags values (although "base:allocation" will remain usable with
  just 32-bit flags); reply with offset rather than padding and with
  fields rearranged for maximal sharing between client and server
  layouts

- word-smithing of the NBD proposed protocol; split things into
  several smaller patches, where we can choose how much to take

- more unit tests added in qemu and libnbd

- the series end with RFC patches on whether to support 64-bit hole
  responses to NBD_CMD_READ, even though our current limitations say
  servers don't have to accept more than a 32M read request and
  therefore will never have that big of a hole to read)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-14 Thread Christian Borntraeger




Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:

Am 08.11.22 um 10:23 schrieb Alex Bennée:

The previous fix to virtio_device_started revealed a problem in its
use by both the core and the device code. The core code should be able
to handle the device "starting" while the VM isn't running to handle
the restoration of migration state. To solve this dual use introduce a
new helper for use by the vhost-user backends who all use it to feed a
should_start variable.

We can also pick up a change vhost_user_blk_set_status while we are at
it which follows the same pattern.

Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
Signed-off-by: Alex Bennée 
Cc: "Michael S. Tsirkin" 


Hmmm, is this
commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
Author: Alex Bennée 
AuthorDate: Mon Nov 7 12:14:07 2022 +
Commit: Michael S. Tsirkin 
CommitDate: Mon Nov 7 14:08:18 2022 -0500

  hw/virtio: introduce virtio_device_should_start

and older version?


This is what got merged:
https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
This patch was sent after I merged the RFC.
I think the only difference is the commit log but I might be missing
something.


This does not seem to fix the regression that I have reported.


This was applied on top of 9f6bcfd99f which IIUC does, right?




QEMU master still fails for me for suspend/resume to disk:

#0  0x03ff8e3980a6 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x03ff8e348580 in raise () at /lib64/libc.so.6
#2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
#3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
#4  0x03ff8e340a4e in  () at /lib64/libc.so.6
#5  0x02aa1ffa8966 in vhost_vsock_common_pre_save (opaque=) 
at ../hw/virtio/vhost-vsock-common.c:203
#6  0x02aa1fe5e0ee in vmstate_save_state_v
 (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0 , 
opaque=0x2aa21bac9f8, vmdesc=vmdesc@entry=0x3fddc08eb30, 
version_id=version_id@entry=0) at ../migration/vmstate.c:329
#7  0x02aa1fe5ebf8 in vmstate_save_state (f=f@entry=0x2aa21bdc170, vmsd=, opaque=, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30) at 
../migration/vmstate.c:317
#8  0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170, 
se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at 
../migration/savevm.c:908
#9  0x02aa1fe79584 in qemu_savevm_state_complete_precopy_non_iterable 
(f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false, 
inactivate_disks=inactivate_disks@entry=true)
 at ../migration/savevm.c:1393
#10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy (f=0x2aa21bdc170, 
iterable_only=iterable_only@entry=false, 
inactivate_disks=inactivate_disks@entry=true) at ../migration/savevm.c:1459
#11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
../migration/migration.c:3314
#12 migration_iteration_run (s=0x2aa218ef600) at ../migration/migration.c:3761
#13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
../migration/migration.c:3989
#14 0x02aa201f0b8c in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:505
#15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
#16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6

Michael, your previous branch did work if I recall correctly.


That one was failing under github CI though (for reasons we didn't
really address, such as disconnect during stop causing a recursive
call to stop, but there you are).

Even the double revert of everything?
So how do we proceed now?



Re: [PATCH v3 1/7] memory: associate DMA accesses with the initiator Device

2022-11-14 Thread Stefan Hajnoczi
On Fri, 28 Oct 2022 at 15:19, Alexander Bulekov  wrote:
>
> Add transitionary DMA APIs which associate accesses with the device
> initiating them. The modified APIs maintain a "MemReentrancyGuard" in
> the DeviceState, which is used to prevent DMA re-entrancy issues.
> The MemReentrancyGuard is set/checked when entering IO handlers and when
> initiating a DMA access.
>
> 1.) mmio -> dma -> mmio case
> 2.) bh -> dma write -> mmio case
>
> These issues have led to problems such as stack-exhaustion and
> use-after-frees.
>
> Summary of the problem from Peter Maydell:
> https://lore.kernel.org/qemu-devel/cafeaca_23vc7he3iam-jva6w38lk4hjowae5kcknhprd5fp...@mail.gmail.com
>
> Signed-off-by: Alexander Bulekov 
> ---
>  include/hw/qdev-core.h |  2 ++
>  include/sysemu/dma.h   | 41 +
>  softmmu/memory.c   | 15 +++
>  softmmu/trace-events   |  1 +
>  4 files changed, 59 insertions(+)
>
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 785dd5a56e..ab78d211af 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -8,6 +8,7 @@
>  #include "qom/object.h"
>  #include "hw/hotplug.h"
>  #include "hw/resettable.h"
> +#include "sysemu/dma.h"
>
>  enum {
>  DEV_NVECTORS_UNSPECIFIED = -1,
> @@ -194,6 +195,7 @@ struct DeviceState {
>  int alias_required_for_version;
>  ResettableState reset;
>  GSList *unplug_blockers;
> +MemReentrancyGuard mem_reentrancy_guard;
>  };
>
>  struct DeviceListener {
> diff --git a/include/sysemu/dma.h b/include/sysemu/dma.h
> index a1ac5bc1b5..879b666bbb 100644
> --- a/include/sysemu/dma.h
> +++ b/include/sysemu/dma.h
> @@ -15,6 +15,10 @@
>  #include "block/block.h"
>  #include "block/accounting.h"
>
> +typedef struct {
> +bool engaged_in_io;
> +} MemReentrancyGuard;

Please add a doc comment that explains the purpose of MemReentrancyGuard.

> +
>  typedef enum {
>  DMA_DIRECTION_TO_DEVICE = 0,
>  DMA_DIRECTION_FROM_DEVICE = 1,
> @@ -321,4 +325,41 @@ void dma_acct_start(BlockBackend *blk, BlockAcctCookie 
> *cookie,
>  uint64_t dma_aligned_pow2_mask(uint64_t start, uint64_t end,
> int max_addr_bits);
>
> +#define REENTRANCY_GUARD(func, ret_type, dev, ...) \
> +({\
> + ret_type retval;\
> + MemReentrancyGuard prior_guard_state = dev->mem_reentrancy_guard;\
> + dev->mem_reentrancy_guard.engaged_in_io = 1;\

Please use true/false for bool constants. That way it's obvious to the
reader that this is a bool and not an int.

> + retval = func(__VA_ARGS__);\
> + dev->mem_reentrancy_guard = prior_guard_state;\
> + retval;\
> + })

I'm trying to understand the purpose of this macro. It restores the
previous state of mem_reentrancy_guard, implying that this is
sometimes called when the guard is already true (i.e. from
MemoryRegion callbacks). It can also be called in the BH case and I
think that's why mem_reentrancy_guard is set to true here. Using BHs
to avoid deep stacks and re-entrancy is a valid technique though, and
this macro seems to be designed to prevent it. Can you explain a bit
more about how this is supposed to be used?

If this macro is a public API that other parts of QEMU will use, then
the following approach is more consistent with how the lock guard
macros work:

  REENTRANCY_GUARD(dev) {
  retval = func(1, 2, 3);
  }

It's also more readable then:

  REENTRANCY_GUARD(func, int, dev, 1, 2, 3);

?

> +#define REENTRANCY_GUARD_NORET(func, dev, ...) \
> +({\
> + MemReentrancyGuard prior_guard_state = dev->mem_reentrancy_guard;\
> + dev->mem_reentrancy_guard.engaged_in_io = 1;\
> + func(__VA_ARGS__);\
> + dev->mem_reentrancy_guard = prior_guard_state;\
> + })
> +#define dma_memory_rw_guarded(dev, ...) \
> +REENTRANCY_GUARD(dma_memory_rw, MemTxResult, dev, __VA_ARGS__)
> +#define dma_memory_read_guarded(dev, ...) \
> +REENTRANCY_GUARD(dma_memory_read, MemTxResult, dev, __VA_ARGS__)
> +#define dma_memory_write_guarded(dev, ...) \
> +REENTRANCY_GUARD(dma_memory_write, MemTxResult, dev, __VA_ARGS__)
> +#define dma_memory_set_guarded(dev, ...) \
> +REENTRANCY_GUARD(dma_memory_set, MemTxResult, dev, __VA_ARGS__)
> +#define dma_memory_map_guarded(dev, ...) \
> +REENTRANCY_GUARD(dma_memory_map, void*, dev, __VA_ARGS__)
> +#define dma_memory_unmap_guarded(dev, ...) \
> +REENTRANCY_GUARD_NORET(dma_memory_unmap, dev, __VA_ARGS__)
> +#define ldub_dma_guarded(dev, ...) \
> +REENTRANCY_GUARD(ldub_dma, MemTxResult, dev, __VA_ARGS__)
> +#define stb_dma_guarded(dev, ...) \
> +REENTRANCY_GUARD(stb_dma, MemTxResult, dev, __VA_ARGS__)
> +#define dma_buf_read_guarded(dev, ...) \
> +REENTRANCY_GUARD(dma_buf_read, MemTxResult, dev, __VA_ARGS__)
> +#define dma_buf_write_guarded(dev, ...) \
> +REENTRANCY_GUARD(dma_buf_read, MemTxResult, dev, __VA_ARGS__)
> +
>  #endif
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 7ba2048836..c44dc75149 100644
> --- 

[PATCH v2 6/6] RFC: spec: Introduce NBD_REPLY_TYPE_OFFSET_HOLE_EXT

2022-11-14 Thread Eric Blake
Rather than requiring all servers and clients to have a 32-bit limit
on maximum NBD_CMD_READ/WRITE sizes, we can choose to standardize
support for a 64-bit single I/O transaction now.
NBD_REPLY_TYPE_OFFSET_DATA can already handle a large reply, but
NBD_REPLY_TYPE_OFFSET_HOLE needs a 64-bit counterpart.

By standardizing this, all clients must be prepared to support both
types of hole type replies, even though most server implementations of
extended replies are likely to only send one hole type.

---

As this may mean a corner-case that gets less testing, I have
separated it into a separate optional patch.  I implemented it in my
proof-of-concept, but am happy to drop this patch for what actually
goes upstream.

In particular, if we foresee clients and servers that WANT to support
a payload larger than 4G, it may be worth introducing an NBD_INFO_*
that supplies 64-bit block sizing information, rather than our current
inherent 32-bit limit of NBD_INFO_BLOCK_SIZE, at the same time as we
introduce this reply type.
---
 doc/proto.md | 73 
 1 file changed, 40 insertions(+), 33 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 645a736..9c04411 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -2008,19 +2008,25 @@ size.
   64 bits: offset (unsigned)  
   32 bits: hole size (unsigned, MUST be nonzero)  

-  At this time, although servers that support extended headers are
-  permitted to accept client requests for `NBD_CMD_READ` with an
-  effect length larger than any advertised maximum block payload size
-  by splitting the reply into multiple chunks, portable clients SHOULD
-  NOT request a read *length* larger than 32 bits (corresponding to
-  the maximum block payload constraint implied by
-  `NBD_INFO_BLOCK_SIZE`), and therefore a 32-bit constraint on the
-  *hole size* does not represent an arbitrary limitation.  Should a
-  future scenario arise where it can be demonstrated that a client and
-  server would benefit from an extension allowing a maximum block
-  payload size to be larger than 32 bits, that extension would also
-  introduce a counterpart reply type that can express a 64-bit *hole
-  size*.
+* `NBD_REPLY_TYPE_OFFSET_HOLE_EXT` (3)
+
+  This chunk type is in the content chunk category.  *length* MUST be
+  exactly 16.  The semantics of this chunk mirror those of
+  `NBD_REPLY_TYPE_OFFSET_HOLE`, other than the use of a larger *hole
+  size* field.  This chunk type MUST NOT be used unless extended
+  headers were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
+
+  The payload is structured as:
+
+  64 bits: offset (unsigned)  
+  64 bits: hole size (unsigned, MUST be nonzero)  
+
+  Note that even though extended headers are in use, a server may
+  enforce a maximum block size that is smaller than 32 bits, in which
+  case no valid `NBD_CMD_READ` will have a *length* large enough to
+  require the use of this chunk type.  However, a client using
+  extended headers MUST be prepared for the server to use either the
+  compact or extended chunk type.

 * `NBD_REPLY_TYPE_BLOCK_STATUS` (5)

@@ -2218,26 +2224,27 @@ The following request types exist:
 the following additional constraints.

 The server MAY split the reply into any number of content chunks
-(`NBD_REPLY_TYPE_OFFSET_DATA` and `NBD_REPLY_TYPE_OFFSET_HOLE`);
-each chunk MUST describe at least one byte, although to minimize
-overhead, the server SHOULD use chunks with lengths and offsets as
-an integer multiple of 512 bytes, where possible (the first and
-last chunk of an unaligned read being the most obvious places for
-an exception).  The server MUST NOT send content chunks that
-overlap with any earlier content or error chunk, and MUST NOT send
-chunks that describe data outside the offset and length of the
-request, but MAY send the content chunks in any order (the client
-MUST reassemble content chunks into the correct order), and MAY
-send additional content chunks even after reporting an error
-chunk.  A server MAY support read requests larger than the maximum
-block payload size by splitting the response across multiple
-chunks (in particular, if extended headers are not in use, a
-request for more than 2^32 - 8 bytes containing data rather than
-holes MUST be split to avoid overflowing the 32-bit
-`NBD_REPLY_TYPE_OFFSET_DATA` length field); however, the server is
-also permitted to reject large read requests up front, so a client
-should be prepared to retry with smaller requests if a large
-request fails.
+(`NBD_REPLY_TYPE_OFFSET_DATA` and `NBD_REPLY_TYPE_OFFSET_HOLE` for
+structured replies, additionally `NBD_REPLY_TYPE_OFFSET_HOLE_EXT`
+for extended headers); each chunk MUST describe at least one byte,
+although to minimize overhead, the server SHOULD use chunks with
+lengths and offsets as an integer multiple of 512 bytes, where
+possible (the first and last chunk 

[RFC PATCH] gitlab: add new dynamic check-gcov target for coverage

2022-11-14 Thread Alex Bennée
The aim of this was to expand the coverage checking to only target
builds affected by the current branch. Unfortunately first time out
the build falls over at the asset uploading stage exceeding some size
limit.

So for now I'm posting this as a proof-of-concept until I can figure
out a better way forward. The highlighting of which changes are tested
in the GitLab UI is quite nice though.

Signed-off-by: Alex Bennée 
---
 .gitlab-ci.d/buildtest.yml   | 24 --
 .gitlab-ci.d/pick-targets.py | 87 
 2 files changed, 107 insertions(+), 4 deletions(-)
 create mode 100755 .gitlab-ci.d/pick-targets.py

diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
index d21b4a1fd4..aa2c52ab11 100644
--- a/.gitlab-ci.d/buildtest.yml
+++ b/.gitlab-ci.d/buildtest.yml
@@ -479,20 +479,36 @@ build-gprof-gcov:
   variables:
 IMAGE: ubuntu2004
 CONFIGURE_ARGS: --enable-gprof --enable-gcov
-TARGETS: aarch64-softmmu ppc64-softmmu s390x-softmmu x86_64-softmmu
+TARGETS: aarch64-softmmu ppc64-linux-user
   artifacts:
 expire_in: 1 days
 paths:
   - build
 
-check-gprof-gcov:
+# This is special as the target list is dynamic
+build-gcov:
+  extends: .native_build_job_template
+  needs:
+job: amd64-ubuntu2004-container
+  before_script:
+- TARGETS=$(.gitlab-ci.d/pick-targets.py)
+  variables:
+IMAGE: ubuntu2004
+CONFIGURE_ARGS: --enable-gcov
+  artifacts:
+expire_in: 1 days
+paths:
+  - build
+  
+# This is special
+check-gcov:
   extends: .native_test_job_template
   needs:
-- job: build-gprof-gcov
+- job: build-gcov
   artifacts: true
   variables:
 IMAGE: ubuntu2004
-MAKE_CHECK_ARGS: check
+MAKE_CHECK_ARGS: check check-avocado
   after_script:
 - cd build
 - gcovr --xml-pretty --exclude-unreachable-branches --print-summary
diff --git a/.gitlab-ci.d/pick-targets.py b/.gitlab-ci.d/pick-targets.py
new file mode 100755
index 00..db1eff0119
--- /dev/null
+++ b/.gitlab-ci.d/pick-targets.py
@@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+#
+# pick-targets: pick a set of targets that are worth testing.
+#
+# Running code coverage is too expensive to run over all the builds.
+# Try and figure out a subset of targets that would be worth running
+# for the files touched between origin/master and the current HEAD.
+#
+# Copyright (C) 2022 Linaro Ltd
+#
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+import os
+import os.path
+import sys
+import subprocess
+
+# Dumb mapping from a target directory name to a list of
+# targets. Should we bother for those we know we don't have compilers for?
+multi_targets = {
+'arm': ['arm', 'aarch64'],
+'i386'   : ['i386', 'x86_64'],
+'microblaze' : ['microblaze' ], # no softmmu, 'microblazel' ],
+'mips'   : ['mips', 'mips64', 'mips64el', 'mipsel' ],
+'ppc': ['ppc', 'ppc64' ],
+'riscv'  : ['riscv32', 'riscv64'],
+'sh4': ['sh4', 'sh4eb'],
+'sparc'  : ['sparc', 'sparc64'],
+'xtensa' : ['xtensa', 'xtensaeb']
+}
+
+def map_dir_to_targets(name):
+if name in multi_targets:
+return multi_targets[name]
+else:
+return name
+
+namespace = "qemu-project"
+if len(sys.argv) >= 2:
+namespace = sys.argv[1]
+
+cwd = os.getcwd()
+reponame = os.path.basename(cwd)
+repourl = f"https://gitlab.com/{namespace}/{reponame};
+
+# Add remote, fetch master and save the common ancestor
+subprocess.check_call(["git", "remote", "add", "pick-target", repourl])
+subprocess.check_call(["git", "fetch", "pick-target", "master"],
+  stdout=subprocess.DEVNULL,
+  stderr=subprocess.DEVNULL)
+
+ancestor = subprocess.check_output(["git", "merge-base",
+"pick-target/master", "HEAD"],
+   universal_newlines=True)
+
+ancestor = ancestor.strip()
+
+subprocess.check_call(["git", "remote", "rm", "pick-target"])
+
+# calculate the diff and extract list of touched files
+diff = subprocess.check_output(["git", "diff", "--numstat",
+ f"{ancestor}..HEAD"])
+
+files = [l.split("\t")[2] for l in diff.decode().split("\n") if "\t" in l]
+
+# Build options to track
+system = False
+user = False
+targets = []
+
+for f in files:
+if f.startswith("hw"):
+system = True
+if f.startswith("linux-user"):
+user = True
+if f.startswith("target"):
+t = f.split("/")[1]
+targets.extend(map_dir_to_targets(t))
+
+target_list = []
+for t in sorted(set(targets)):
+if system:
+target_list.append(f"{t}-softmmu")
+if user:
+target_list.append(f"{t}-linux-user")
+
+print(" ".join(target_list))
-- 
2.34.1




[PATCH v2 0/3] Add OCP extended log to nvme QEMU

2022-11-14 Thread Joel Granados
The motivation and description are contained in the last patch in this set.
Will copy paste it here for convenience:

In order to evaluate write amplification factor (WAF) within the storage
stack it is important to know the number of bytes written to the
controller. The existing SMART log value of Data Units Written is too
coarse (given in units of 500 Kb) and so we add the SMART health
information extended from the OCP specification (given in units of bytes).

To accommodate different vendor specific specifications like OCP, we add a
multiplexing function (nvme_vendor_specific_log) which will route to the
different log functions based on arguments and log ids. We only return the
OCP extended smart log when the command is 0xC0 and ocp has been turned on
in the args.

Though we add the whole nvme smart log extended structure, we only populate
the physical_media_units_{read,written}, log_page_version and
log_page_uuid.

V1 changes:
1. I moved the ocp parameter from the namespace to the subsystem as it is
   defined there in the OCP specification
2. I now accumulate statistics from all namespaces and report them back on
   the extended log as per the spec.
3. I removed the default case in the switch in nvme_vendor_specific_log as
   it does not have any special function.

Joel Granados (3):
  nvme: Move adjustment of data_units{read,written}
  nvme: Add ocp to the subsys
  nvme: Add physical writes/reads from OCP log

 hw/nvme/ctrl.c   | 70 
 hw/nvme/nvme.h   |  1 +
 hw/nvme/subsys.c |  4 +--
 include/block/nvme.h | 36 +++
 4 files changed, 103 insertions(+), 8 deletions(-)

-- 
2.30.2




Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-14 Thread Christian Borntraeger




Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:

Am 08.11.22 um 10:23 schrieb Alex Bennée:

The previous fix to virtio_device_started revealed a problem in its
use by both the core and the device code. The core code should be able
to handle the device "starting" while the VM isn't running to handle
the restoration of migration state. To solve this dual use introduce a
new helper for use by the vhost-user backends who all use it to feed a
should_start variable.

We can also pick up a change vhost_user_blk_set_status while we are at
it which follows the same pattern.

Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
Signed-off-by: Alex Bennée 
Cc: "Michael S. Tsirkin" 


Hmmm, is this
commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
Author: Alex Bennée 
AuthorDate: Mon Nov 7 12:14:07 2022 +
Commit: Michael S. Tsirkin 
CommitDate: Mon Nov 7 14:08:18 2022 -0500

 hw/virtio: introduce virtio_device_should_start

and older version?


This is what got merged:
https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
This patch was sent after I merged the RFC.
I think the only difference is the commit log but I might be missing
something.


This does not seem to fix the regression that I have reported.


This was applied on top of 9f6bcfd99f which IIUC does, right?




QEMU master still fails for me for suspend/resume to disk:

#0  0x03ff8e3980a6 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x03ff8e348580 in raise () at /lib64/libc.so.6
#2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
#3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
#4  0x03ff8e340a4e in  () at /lib64/libc.so.6
#5  0x02aa1ffa8966 in vhost_vsock_common_pre_save (opaque=) 
at ../hw/virtio/vhost-vsock-common.c:203
#6  0x02aa1fe5e0ee in vmstate_save_state_v
(f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0 , 
opaque=0x2aa21bac9f8, vmdesc=vmdesc@entry=0x3fddc08eb30, 
version_id=version_id@entry=0) at ../migration/vmstate.c:329
#7  0x02aa1fe5ebf8 in vmstate_save_state (f=f@entry=0x2aa21bdc170, vmsd=, opaque=, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30) at 
../migration/vmstate.c:317
#8  0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170, 
se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at 
../migration/savevm.c:908
#9  0x02aa1fe79584 in qemu_savevm_state_complete_precopy_non_iterable 
(f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false, 
inactivate_disks=inactivate_disks@entry=true)
at ../migration/savevm.c:1393
#10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy (f=0x2aa21bdc170, 
iterable_only=iterable_only@entry=false, 
inactivate_disks=inactivate_disks@entry=true) at ../migration/savevm.c:1459
#11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
../migration/migration.c:3314
#12 migration_iteration_run (s=0x2aa218ef600) at ../migration/migration.c:3761
#13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
../migration/migration.c:3989
#14 0x02aa201f0b8c in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:505
#15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
#16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6

Michael, your previous branch did work if I recall correctly.



[PULL 1/2] MAINTAINERS: Update maintainer's email for Xilinx CAN

2022-11-14 Thread Peter Maydell
From: Vikram Garhwal 

Signed-off-by: Vikram Garhwal 
Reviewed-by: Francisco Iglesias 
Signed-off-by: Peter Maydell 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index caba73ec41b..be151f00246 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1748,8 +1748,8 @@ F: tests/qtest/intel-hda-test.c
 F: tests/qtest/fuzz-sb16-test.c
 
 Xilinx CAN
-M: Vikram Garhwal 
-M: Francisco Iglesias 
+M: Vikram Garhwal 
+M: Francisco Iglesias 
 S: Maintained
 F: hw/net/can/xlnx-*
 F: include/hw/net/xlnx-*
-- 
2.25.1




Re: [PATCH v2] capstone: use instead of

2022-11-14 Thread Paolo Bonzini
Queued, thanks.

Paolo




Re: [PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

Polling during bdrv_drained_end() can be problematic (and in the future,
we may get cases for bdrv_drained_begin() where polling is forbidden,
and we don't care about already in-flight requests, but just want to
prevent new requests from arriving).

The .bdrv_drained_begin/end callbacks running in a coroutine is the only
reason why we have to do this polling, so make them non-coroutine
callbacks again. None of the callers actually yield any more.

This means that bdrv_drained_end() effectively doesn't poll any more,
even if AIO_WAIT_WHILE() loops are still there (their condition is false
from the beginning). This is generally not a problem, but in
test-bdrv-drain, some additional explicit aio_poll() calls need to be
added because the test case wants to verify the final state after BHs
have executed.

Signed-off-by: Kevin Wolf 
---
  include/block/block_int-common.h | 10 ---
  block.c  |  4 +--
  block/io.c   | 49 +---
  block/qed.c  |  4 +--
  block/throttle.c |  6 ++--
  tests/unit/test-bdrv-drain.c | 18 ++--
  6 files changed, 30 insertions(+), 61 deletions(-)


As the others have already suggested, I’d too drop the _co_ in qed and 
throttle, and the coroutine_fn in throttle.  With that done:


Reviewed-by: Hanna Reitz 




[PATCH v2] hw/intc: sifive_plic: Renumber the S irqs for numa support

2022-11-14 Thread Frédéric Pétrot
Commit 40244040a7a changed the way the S irqs are numbered. This breaks when
using numa configuration, e.g.:
./qemu-system-riscv64 -nographic -machine virt,dumpdtb=numa-tree.dtb \
  -m 2G -smp cpus=16 \
  -object memory-backend-ram,id=mem0,size=512M \
  -object memory-backend-ram,id=mem1,size=512M \
  -object memory-backend-ram,id=mem2,size=512M \
  -object memory-backend-ram,id=mem3,size=512M \
  -numa node,cpus=0-3,memdev=mem0,nodeid=0 \
  -numa node,cpus=4-7,memdev=mem1,nodeid=1 \
  -numa node,cpus=8-11,memdev=mem2,nodeid=2 \
  -numa node,cpus=12-15,memdev=mem3,nodeid=3
leads to:
Unexpected error in object_property_find_err() at ../qom/object.c:1304:
qemu-system-riscv64: Property 'riscv.sifive.plic.unnamed-gpio-out[8]' not
found

This patch makes the nubering of the S irqs identical to what it was before.

Reviewed-by: Alistair Francis 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Frédéric Pétrot 
---
 hw/intc/sifive_plic.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index c2dfacf028..b4949bef97 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -476,11 +476,11 @@ DeviceState *sifive_plic_create(hwaddr addr, char 
*hart_config,
 CPUState *cpu = qemu_get_cpu(cpu_num);
 
 if (plic->addr_config[i].mode == PLICMode_M) {
-qdev_connect_gpio_out(dev, num_harts - plic->hartid_base + cpu_num,
+qdev_connect_gpio_out(dev, cpu_num - hartid_base + num_harts,
   qdev_get_gpio_in(DEVICE(cpu), IRQ_M_EXT));
 }
 if (plic->addr_config[i].mode == PLICMode_S) {
-qdev_connect_gpio_out(dev, cpu_num,
+qdev_connect_gpio_out(dev, cpu_num - hartid_base,
   qdev_get_gpio_in(DEVICE(cpu), IRQ_S_EXT));
 }
 }
-- 
2.37.2




[PULL 1/2] target/i386: fix cmpxchg with 32-bit register destination

2022-11-14 Thread Richard Henderson
From: Paolo Bonzini 

Unlike the memory case, where "the destination operand receives a write
cycle without regard to the result of the comparison", rm must not be
touched altogether if the write fails, including not zero-extending
it on 64-bit processors.  This is not how the movcond currently works,
because it is always followed by a gen_op_mov_reg_v to rm.

To fix it, introduce a new function that is similar to gen_op_mov_reg_v
but writes to a TCG temporary.

Considering that gen_extu(ot, oldv) is not needed in the memory case
either, the two cases for register and memory destinations are different
enough that one might as well fuse the two "if (mod == 3)" into one.
So do that too.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/508
Signed-off-by: Paolo Bonzini 
[rth: Add a test case ]
Signed-off-by: Richard Henderson 
---
 target/i386/tcg/translate.c  | 82 ++--
 tests/tcg/x86_64/cmpxchg.c   | 42 
 tests/tcg/x86_64/Makefile.target |  1 +
 3 files changed, 99 insertions(+), 26 deletions(-)
 create mode 100644 tests/tcg/x86_64/cmpxchg.c

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 28a4e6dc1d..dbd6492778 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -439,32 +439,51 @@ static inline MemOp mo_b_d32(int b, MemOp ot)
 return b & 1 ? (ot == MO_16 ? MO_16 : MO_32) : MO_8;
 }
 
-static void gen_op_mov_reg_v(DisasContext *s, MemOp ot, int reg, TCGv t0)
+/* Compute the result of writing t0 to the OT-sized register REG.
+ *
+ * If DEST is NULL, store the result into the register and return the
+ * register's TCGv.
+ *
+ * If DEST is not NULL, store the result into DEST and return the
+ * register's TCGv.
+ */
+static TCGv gen_op_deposit_reg_v(DisasContext *s, MemOp ot, int reg, TCGv 
dest, TCGv t0)
 {
 switch(ot) {
 case MO_8:
-if (!byte_reg_is_xH(s, reg)) {
-tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], t0, 0, 8);
-} else {
-tcg_gen_deposit_tl(cpu_regs[reg - 4], cpu_regs[reg - 4], t0, 8, 8);
+if (byte_reg_is_xH(s, reg)) {
+dest = dest ? dest : cpu_regs[reg - 4];
+tcg_gen_deposit_tl(dest, cpu_regs[reg - 4], t0, 8, 8);
+return cpu_regs[reg - 4];
 }
+dest = dest ? dest : cpu_regs[reg];
+tcg_gen_deposit_tl(dest, cpu_regs[reg], t0, 0, 8);
 break;
 case MO_16:
-tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], t0, 0, 16);
+dest = dest ? dest : cpu_regs[reg];
+tcg_gen_deposit_tl(dest, cpu_regs[reg], t0, 0, 16);
 break;
 case MO_32:
 /* For x86_64, this sets the higher half of register to zero.
For i386, this is equivalent to a mov. */
-tcg_gen_ext32u_tl(cpu_regs[reg], t0);
+dest = dest ? dest : cpu_regs[reg];
+tcg_gen_ext32u_tl(dest, t0);
 break;
 #ifdef TARGET_X86_64
 case MO_64:
-tcg_gen_mov_tl(cpu_regs[reg], t0);
+dest = dest ? dest : cpu_regs[reg];
+tcg_gen_mov_tl(dest, t0);
 break;
 #endif
 default:
 tcg_abort();
 }
+return cpu_regs[reg];
+}
+
+static void gen_op_mov_reg_v(DisasContext *s, MemOp ot, int reg, TCGv t0)
+{
+gen_op_deposit_reg_v(s, ot, reg, NULL, t0);
 }
 
 static inline
@@ -3747,7 +3766,7 @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
 case 0x1b0:
 case 0x1b1: /* cmpxchg Ev, Gv */
 {
-TCGv oldv, newv, cmpv;
+TCGv oldv, newv, cmpv, dest;
 
 ot = mo_b_d(b, dflag);
 modrm = x86_ldub_code(env, s);
@@ -3758,7 +3777,7 @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
 cmpv = tcg_temp_new();
 gen_op_mov_v_reg(s, ot, newv, reg);
 tcg_gen_mov_tl(cmpv, cpu_regs[R_EAX]);
-
+gen_extu(ot, cmpv);
 if (s->prefix & PREFIX_LOCK) {
 if (mod == 3) {
 goto illegal_op;
@@ -3766,32 +3785,43 @@ static bool disas_insn(DisasContext *s, CPUState *cpu)
 gen_lea_modrm(env, s, modrm);
 tcg_gen_atomic_cmpxchg_tl(oldv, s->A0, cmpv, newv,
   s->mem_index, ot | MO_LE);
-gen_op_mov_reg_v(s, ot, R_EAX, oldv);
 } else {
 if (mod == 3) {
 rm = (modrm & 7) | REX_B(s);
 gen_op_mov_v_reg(s, ot, oldv, rm);
+gen_extu(ot, oldv);
+
+/*
+ * Unlike the memory case, where "the destination operand 
receives
+ * a write cycle without regard to the result of the 
comparison",
+ * rm must not be touched altogether if the write fails, 
including
+ * not zero-extending it on 64-bit processors.  So, 
precompute
+ * the result of a successful writeback and perform the 
movcond
+ 

Re: [PATCH 2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen

2022-11-14 Thread Andrew Cooper
On 14/11/2022 19:20, Marek Marczykowski-Górecki wrote:
> The /dev/mem is used for two purposes:
>  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
>  - reading Pending Bit Array (PBA)
>
> The first one was originally done because when Xen did not send all
> vector ctrl writes to the device model, so QEMU might have outdated old
> register value. This has been changed in Xen, so QEMU can now use its
> cached value of the register instead.
>
> The Pending Bit Array (PBA) handling is for the case where it lives on
> the same page as the MSI-X table itself. Xen has been extended to handle
> this case too (as well as other registers that may live on those pages),
> so QEMU handling is not necessary anymore.
>
> Removing /dev/mem access is useful to work within stubdomain, and
> necessary when dom0 kernel runs in lockdown mode.
>
> Signed-off-by: Marek Marczykowski-Górecki 

The commit message ought to go further.  Using /dev/mem like this is
buggy anyway, because it is trapped and emulated by Xen in whatever
context Qemu is running.  Qemu cannot get the actual hardware value, and
even if it could, it would be racy with transient operations needing to
mask the vector.

i.e. it's not just nice-to-remote - it's fixing real corner cases.

~Andrew


Re: [PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

Instead of using a subtree drain from the top node (which also drains
child nodes of base that we're not even interested in), use a normal
drain for base, which automatically drains all of the parents, too.

Signed-off-by: Kevin Wolf 
---
  block.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)


Reviewed-by: Hanna Reitz 




[PATCH v2 5/6] spec: Introduce NBD_FLAG_BLOCK_STATUS_PAYLOAD

2022-11-14 Thread Eric Blake
NBD_CMD_BLOCK_STATUS currently forces the server to reply to all
metacontext ids that the client negotiated via
NBD_OPT_SET_META_CONTEXT.  But since extended headers make it easy for
the client to pass command payloads, we can allow for a client to
negotiate multiple metacontexts up front but express dynamic interest
in varying subsets of those contexts over the life of the connection,
for less wasted effort in responding to NBD_CMD_BLOCK_STATUS.  This
works by having the command payload supply an effect length and a list
of ids the client is currently interested in.

Signed-off-by: Eric Blake 
---
 doc/proto.md | 62 +---
 1 file changed, 49 insertions(+), 13 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 14af48d..645a736 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -397,17 +397,20 @@ additional bytes of payload are present), or if the flag 
is absent
 (there is no payload, and *length* instead is an effect length
 describing how much of the image the request operates on).  The
 command `NBD_CMD_WRITE` MUST use the flag `NBD_CMD_FLAG_PAYLOAD_LEN`
-in this mode; while other commands SHOULD avoid the flag if the
-server has not indicated extension suppport for payloads on that
-command.  A server SHOULD initiate hard disconnect if a client sets
-the `NBD_CMD_FLAG_PAYLOAD_LEN` flag and uses a *length* larger than
-a server's advertised or default maximum payload length (capped at
-32 bits by the constraints of `NBD_INFO_BLOCK_SIZE`); in all other
-cases, a server SHOULD gracefully consume *length* bytes of payload
-(even if it then replies with an `NBD_EINVAL` failure because the
-particular command was not expecting a payload), and proceed with
-the next client command.  Thus, only when *length* is used as an
-effective length will it utilize a full 64-bit value.
+in this mode; most other commands omit it, although some like
+`NBD_CMD_BLOCK_STATUS` optionally support the flag in order to allow
+the client to pass additional information in the payload (where the
+command documents what the payload will contain, including the
+possibility of a separate effect length).  A server SHOULD initiate
+hard disconnect if a client sets the `NBD_CMD_FLAG_PAYLOAD_LEN` flag
+and uses a *length* larger than a server's advertised or default
+maximum payload length (capped at 32 bits by the constraints of
+`NBD_INFO_BLOCK_SIZE`); in all other cases, a server SHOULD gracefully
+consume *length* bytes of payload (even if it then replies with an
+`NBD_EINVAL` failure because the particular command was not expecting
+a payload), and proceed with the next client command.  Thus, only when
+*length* is used as an effective length will it utilize a full 64-bit
+value.

  Simple reply message

@@ -1232,6 +1235,19 @@ The field has the following format:
   will be faster than a regular write). Clients MUST NOT set the
   `NBD_CMD_FLAG_FAST_ZERO` request flag unless this transmission flag
   is set.
+- bit 12, `NBD_FLAG_BLOCK_STATUS_PAYLOAD`: Indicates that the server
+  understands the use of the `NBD_CMD_FLAG_PAYLOAD_LEN` flag to
+  `NBD_CMD_BLOCK_STATUS` to allow the client to request that the
+  server filters its response to a specific subset of negotiated
+  metacontext ids passed in via a client payload, rather than the
+  default of replying to all metacontext ids. Servers MUST NOT
+  advertise this bit unless the client successfully negotiates
+  extended headers via `NBD_OPT_EXTENDED_HEADERS`, and SHOULD NOT
+  advertise this bit in response to `NBD_OPT_EXPORT_NAME` or
+  `NBD_OPT_GO` if the client does not negotiate metacontexts with
+  `NBD_OPT_SET_META_CONTEXT`; clients SHOULD NOT set the
+  `NBD_CMD_FLAG_PAYLOAD_LEN` flag for `NBD_CMD_BLOCK_STATUS` unless
+  this transmission flag is set.

 Clients SHOULD ignore unknown flags.

@@ -1915,8 +1931,11 @@ valid may depend on negotiation during the handshake 
phase.
   header.  With extended headers, the flag MUST be set for
   `NBD_CMD_WRITE` (as the write command always sends a payload of the
   bytes to be written); for other commands, the flag will trigger an
-  `NBD_EINVAL` error unless the server has advertised support for an
-  extension payload form for the command.
+  `NBD_EINVAL` error unless the command documents an optional payload
+  form for the command and the server has implemented that form (an
+  example being `NBD_CMD_BLOCK_STATUS` providing a payload form for
+  restricting the response to a particular metacontext id, when the
+  server advertises `NBD_FLAG_BLOCK_STATUS_PAYLOAD`).

 # Structured reply flags

@@ -2464,6 +2483,23 @@ The following request types exist:
 The server MAY send chunks in a different order than the context
 ids were assigned in reply to `NBD_OPT_SET_META_CONTEXT`.

+If extended headers were negotiated, a server MAY optionally
+advertise, via the transmission flag
+`NBD_FLAG_BLOCK_STATUS_PAYLOAD`, that it supports an alternative
+request 

[PULL v2 00/11] Block layer patches

2022-11-14 Thread Kevin Wolf
The following changes since commit 2ccad61746ca7de5dd3e25146062264387e43bd4:

  Merge tag 'pull-tcg-20221109' of https://gitlab.com/rth7680/qemu into staging 
(2022-11-09 13:26:45 -0500)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 46530d3560b6fd5de38b6f4e45ce7d7135b9add7:

  tests/stream-under-throttle: New test (2022-11-14 11:31:52 +0100)


Block layer patches

- Fix deadlock in graph modification with iothreads
- mirror: Fix non-converging cases for active mirror
- qapi: Fix BlockdevOptionsNvmeIoUring @path description
- blkio: Set BlockDriver::has_variable_length to false


Alberto Faria (2):
  qapi/block-core: Fix BlockdevOptionsNvmeIoUring @path description
  block/blkio: Set BlockDriver::has_variable_length to false

Hanna Reitz (9):
  block/mirror: Do not wait for active writes
  block/mirror: Drop mirror_wait_for_any_operation()
  block/mirror: Fix NULL s->job in active writes
  iotests/151: Test that active mirror progresses
  iotests/151: Test active requests on mirror start
  block: Make bdrv_child_get_parent_aio_context I/O
  block-backend: Update ctx immediately after root
  block: Start/end drain on correct AioContext
  tests/stream-under-throttle: New test

 qapi/block-core.json   |   2 +-
 include/block/block-global-state.h |   1 -
 include/block/block-io.h   |   2 +
 include/block/block_int-common.h   |   4 +-
 block.c|   2 +-
 block/blkio.c  |   1 -
 block/block-backend.c  |   9 +-
 block/io.c |   6 +-
 block/mirror.c |  78 ---
 blockjob.c |   3 +-
 tests/qemu-iotests/151 | 227 -
 tests/qemu-iotests/151.out |   4 +-
 tests/qemu-iotests/tests/stream-under-throttle | 121 +++
 tests/qemu-iotests/tests/stream-under-throttle.out |   5 +
 14 files changed, 424 insertions(+), 41 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-under-throttle
 create mode 100644 tests/qemu-iotests/tests/stream-under-throttle.out




Re: [PATCH] hw/sd: Fix sun4i allwinner-sdhost for U-Boot

2022-11-14 Thread Strahinja Jankovic
On Mon, Nov 14, 2022 at 6:36 PM Peter Maydell  wrote:
>
> On Mon, 14 Nov 2022 at 17:29, Strahinja Jankovic
>  wrote:
> >
> > Hi,
> >
> > Thank you for your reply.
> >
> > On Mon, Nov 14, 2022 at 4:42 PM Peter Maydell  
> > wrote:
> > >
> > > On Sat, 12 Nov 2022 at 21:49, Strahinja Jankovic
> > >  wrote:
> > > >
> > > > Trying to run U-Boot for Cubieboard (Allwinner A10) fails because it 
> > > > cannot
> > > > access SD card. The problem is that FIFO register in current
> > > > allwinner-sdhost implementation is at the address corresponding to
> > > > Allwinner H3, but not A10.
> > > > Linux kernel is not affected since Linux driver uses DMA access and does
> > > > not use FIFO register for reading/writing.
> > > >
> > > > This patch adds new class parameter `is_sun4i` and based on that
> > > > parameter uses register at offset 0x100 either as FIFO register (if
> > > > sun4i) or as threshold register (if not sun4i; in this case register at
> > > > 0x200 is FIFO register).
> > > >
> > > > Tested with U-Boot and Linux kernel image built for Cubieboard and
> > > > OrangePi PC.
> > > >
> > > > Signed-off-by: Strahinja Jankovic 
> > > > ---
> > > >  hw/sd/allwinner-sdhost.c | 67 ++--
> > > >  include/hw/sd/allwinner-sdhost.h |  1 +
> > > >  2 files changed, 47 insertions(+), 21 deletions(-)
> > > >
> > > > diff --git a/hw/sd/allwinner-sdhost.c b/hw/sd/allwinner-sdhost.c
> > > > index 455d6eabf6..51e5e90830 100644
> > > > --- a/hw/sd/allwinner-sdhost.c
> > > > +++ b/hw/sd/allwinner-sdhost.c
> > > > @@ -65,7 +65,7 @@ enum {
> > > >  REG_SD_DLBA   = 0x84,  /* Descriptor List Base Address */
> > > >  REG_SD_IDST   = 0x88,  /* Internal DMA Controller Status */
> > > >  REG_SD_IDIE   = 0x8C,  /* Internal DMA Controller IRQ Enable */
> > > > -REG_SD_THLDC  = 0x100, /* Card Threshold Control */
> > > > +REG_SD_THLDC  = 0x100, /* Card Threshold Control / FIFO (sun4i 
> > > > only)*/
> > > >  REG_SD_DSBD   = 0x10C, /* eMMC DDR Start Bit Detection Control 
> > > > */
> > > >  REG_SD_RES_CRC= 0x110, /* Response CRC from card/eMMC */
> > > >  REG_SD_DATA7_CRC  = 0x114, /* CRC Data 7 from card/eMMC */
> > > > @@ -415,10 +415,29 @@ static void allwinner_sdhost_dma(AwSdHostState *s)
> > > >  }
> > > >  }
> > > >
> > > > +static uint32_t allwinner_sdhost_fifo_read(AwSdHostState *s)
> > > > +{
> > > > +uint32_t res = 0;
> > > > +
> > > > +if (sdbus_data_ready(>sdbus)) {
> > > > +sdbus_read_data(>sdbus, , sizeof(uint32_t));
> > > > +le32_to_cpus();
> > > > +allwinner_sdhost_update_transfer_cnt(s, sizeof(uint32_t));
> > > > +allwinner_sdhost_auto_stop(s);
> > > > +allwinner_sdhost_update_irq(s);
> > > > +} else {
> > > > +qemu_log_mask(LOG_GUEST_ERROR, "%s: no data ready on SD bus\n",
> > > > +  __func__);
> > > > +}
> > > > +
> > > > +return res;
> > > > +}
> > > > +
> > > >  static uint64_t allwinner_sdhost_read(void *opaque, hwaddr offset,
> > > >unsigned size)
> > > >  {
> > > >  AwSdHostState *s = AW_SDHOST(opaque);
> > > > +AwSdHostClass *sc = AW_SDHOST_GET_CLASS(s);
> > > >  uint32_t res = 0;
> > > >
> > > >  switch (offset) {
> > > > @@ -508,8 +527,12 @@ static uint64_t allwinner_sdhost_read(void 
> > > > *opaque, hwaddr offset,
> > > >  case REG_SD_IDIE:  /* Internal DMA Controller Interrupt Enable 
> > > > */
> > > >  res = s->dmac_irq;
> > > >  break;
> > > > -case REG_SD_THLDC: /* Card Threshold Control */
> > > > -res = s->card_threshold;
> > > > +case REG_SD_THLDC: /* Card Threshold Control or FIFO register 
> > > > (sun4i) */
> > > > +if (sc->is_sun4i) {
> > > > +res = allwinner_sdhost_fifo_read(s);
> > > > +} else {
> > > > +res = s->card_threshold;
> > > > +}
> > > >  break;
> > > >  case REG_SD_DSBD:  /* eMMC DDR Start Bit Detection Control */
> > > >  res = s->startbit_detect;
> > > > @@ -531,16 +554,7 @@ static uint64_t allwinner_sdhost_read(void 
> > > > *opaque, hwaddr offset,
> > > >  res = s->status_crc;
> > > >  break;
> > > >  case REG_SD_FIFO:  /* Read/Write FIFO */
> > > > -if (sdbus_data_ready(>sdbus)) {
> > > > -sdbus_read_data(>sdbus, , sizeof(uint32_t));
> > > > -le32_to_cpus();
> > > > -allwinner_sdhost_update_transfer_cnt(s, sizeof(uint32_t));
> > > > -allwinner_sdhost_auto_stop(s);
> > > > -allwinner_sdhost_update_irq(s);
> > > > -} else {
> > > > -qemu_log_mask(LOG_GUEST_ERROR, "%s: no data ready on SD 
> > > > bus\n",
> > > > -  __func__);
> > > > -}
> > > > +res = allwinner_sdhost_fifo_read(s);
> > >
> > > Does the sun4i really have the FIFO at both addresses, or should
> > > this one 

[PATCH v2 3/3] nvme: Add physical writes/reads from OCP log

2022-11-14 Thread Joel Granados
In order to evaluate write amplification factor (WAF) within the storage
stack it is important to know the number of bytes written to the
controller. The existing SMART log value of Data Units Written is too
coarse (given in units of 500 Kb) and so we add the SMART health
information extended from the OCP specification (given in units of bytes).

To accomodate different vendor specific specifications like OCP, we add a
multiplexing function (nvme_vendor_specific_log) which will route to the
different log functions based on arguments and log ids. We only return the
OCP extended smart log when the command is 0xC0 and ocp has been turned on
in the args.

Though we add the whole nvme smart log extended structure, we only populate
the physical_media_units_{read,written}, log_page_version and
log_page_uuid.

Signed-off-by: Joel Granados 

squash with main

Signed-off-by: Joel Granados 
---
 hw/nvme/ctrl.c   | 56 
 include/block/nvme.h | 36 
 2 files changed, 92 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 220683201a..5e6a8150a2 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4455,6 +4455,42 @@ static void nvme_set_blk_stats(NvmeNamespace *ns, struct 
nvme_stats *stats)
 stats->write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
 }
 
+static uint16_t nvme_ocp_extended_smart_info(NvmeCtrl *n, uint8_t rae,
+ uint32_t buf_len, uint64_t off,
+ NvmeRequest *req)
+{
+NvmeNamespace *ns = NULL;
+NvmeSmartLogExtended smart_ext = { 0 };
+struct nvme_stats stats = { 0 };
+uint32_t trans_len;
+
+if (off >= sizeof(smart_ext)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+// Accumulate all stats from all namespaces
+for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns)
+{
+nvme_set_blk_stats(ns, );
+}
+}
+
+smart_ext.physical_media_units_written[0] = 
cpu_to_le32(stats.units_written);
+smart_ext.physical_media_units_read[0] = cpu_to_le32(stats.units_read);
+smart_ext.log_page_version = 0x0003;
+smart_ext.log_page_uuid[0] = 0xA4F2BFEA2810AFC5;
+smart_ext.log_page_uuid[1] = 0xAFD514C97C6F4F9C;
+
+if (!rae) {
+nvme_clear_events(n, NVME_AER_TYPE_SMART);
+}
+
+trans_len = MIN(sizeof(smart_ext) - off, buf_len);
+return nvme_c2h(n, (uint8_t *) _ext + off, trans_len, req);
+}
+
 static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
 uint64_t off, NvmeRequest *req)
 {
@@ -4642,6 +4678,24 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t 
csi, uint32_t buf_len,
 return nvme_c2h(n, ((uint8_t *)) + off, trans_len, req);
 }
 
+static uint16_t nvme_vendor_specific_log(uint8_t lid, NvmeCtrl *n, uint8_t rae,
+ uint32_t buf_len, uint64_t off,
+ NvmeRequest *req)
+{
+NvmeSubsystem *subsys = n->subsys;
+switch (lid) {
+case NVME_LOG_VENDOR_START:
+if (subsys->params.ocp) {
+return nvme_ocp_extended_smart_info(n, rae, buf_len, off, req);
+}
+break;
+/* Add a case for each additional vendor specific log id */
+}
+
+trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeCmd *cmd = >cmd;
@@ -4683,6 +4737,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_error_info(n, rae, len, off, req);
 case NVME_LOG_SMART_INFO:
 return nvme_smart_info(n, rae, len, off, req);
+case NVME_LOG_VENDOR_START...NVME_LOG_VENDOR_END:
+return nvme_vendor_specific_log(lid, n, rae, len, off, req);
 case NVME_LOG_FW_SLOT_INFO:
 return nvme_fw_log_info(n, len, off, req);
 case NVME_LOG_CHANGED_NSLIST:
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 8027b7126b..2ab0dca529 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -978,6 +978,40 @@ typedef struct QEMU_PACKED NvmeSmartLog {
 uint8_t reserved2[320];
 } NvmeSmartLog;
 
+typedef struct QEMU_PACKED NvmeSmartLogExtended {
+uint64_tphysical_media_units_written[2];
+uint64_tphysical_media_units_read[2];
+uint64_tbad_user_blocks;
+uint64_tbad_system_nand_blocks;
+uint64_txor_recovery_count;
+uint64_tuncorrectable_read_error_count;
+uint64_tsoft_ecc_error_count;
+uint64_tend2end_correction_counts;
+uint8_t system_data_percent_used;
+uint8_t refresh_counts[7];
+uint64_tuser_data_erase_counts;
+uint16_tthermal_throttling_stat_and_count;
+uint16_tdssd_spec_version[3];
+uint64_tpcie_correctable_error_count;
+uint32_t   

Re: [PATCH 3/4] hw/cxl: Multi-Region CXL Type-3 Devices (Volatile and Persistent)

2022-11-14 Thread Jonathan Cameron via
On Tue, 25 Oct 2022 20:47:36 -0400
Gregory Price  wrote:

> This commit enables each CXL Type-3 device to contain one volatile
> memory region and one persistent region.
> 
> Two new properties have been added to cxl-type3 device initialization:
> [volatile-memdev] and [persistent-memdev]
> 
> The existing [memdev] property has been deprecated and will default the
> memory region to a persistent memory region (although a user may assign
> the region to a ram or file backed region). It cannot be used in
> combination with the new [persistent-memdev] property.
> 
> Partitioning volatile memory from persistent memory is not yet supported.
> 
> Volatile memory is mapped at DPA(0x0), while Persistent memory is mapped
> at DPA(vmem->size), per CXL Spec 8.2.9.8.2.0 - Get Partition Info.
> 
> Signed-off-by: Gregory Price 

Hi Gregory,

I've not been rushing on this purely because we are after the feature
freeze for this QEMU cycle so no great rush to line up new features
(and there was some fun with the pull request the previous set of QEMU
CXL features were in - unrelated to those patches).

A few comments inline.

Once I've chased down a segfault on power down of my qemu setup (that
seems to have nothing to do with the CXL support. *sigh*) I'll push out
an updated tree with this on it for testing purposes.

Thanks,

Jonathan

> ---
>  docs/system/devices/cxl.rst |  53 +--
>  hw/cxl/cxl-mailbox-utils.c  |  21 ++-
>  hw/mem/cxl_type3.c  | 274 +++-
>  include/hw/cxl/cxl_device.h |  11 +-
>  tests/qtest/cxl-test.c  | 111 +++
>  5 files changed, 348 insertions(+), 122 deletions(-)
> 
> diff --git a/docs/system/devices/cxl.rst b/docs/system/devices/cxl.rst
> index f25783a4ec..9e165064c8 100644
> --- a/docs/system/devices/cxl.rst
> +++ b/docs/system/devices/cxl.rst
> @@ -300,15 +300,36 @@ Example topology involving a switch::
>  
>  Example command lines
>  -
> -A very simple setup with just one directly attached CXL Type 3 device::
> +A very simple setup with just one directly attached CXL Type 3 Persistent 
> Memory device::
>  
>qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 
> -cpu max \
>...
> -  -object 
> memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \
> -  -object 
> memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
> +  -object 
> memory-backend-file,pmem=true,id=pmem0,share=on,mem-path=/tmp/cxltest.raw,size=256M
>  \
> +  -object 
> memory-backend-file,pmem=true,id=cxl-lsa0,share=on,mem-path=/tmp/lsa.raw,size=256M
>  \

Why make the pmem=true change in here? If we want to do that I think it should 
be in a
separate patch with explanation.

> +  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> +  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> +  -device 
> cxl-type3,bus=root_port13,persistent-memdev=pmem0,lsa=cxl-lsa1,id=cxl-pmem0 \
> +  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
> +
> +A very simple setup with just one directly attached CXL Type 3 Volatile 
> Memory device::
> +
> +  qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 
> -cpu max \
> +  ...
> +  -object memory-backend-ram,id=vmem0,share=on,size=256M \
>-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> -  -device 
> cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> +  -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> +  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G

...


> diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> index d7543fd5b4..c1183614b9 100644
> --- a/hw/cxl/cxl-mailbox-utils.c
> +++ b/hw/cxl/cxl-mailbox-utils.c
> @@ -269,7 +269,7 @@ static ret_code cmd_firmware_update_get_info(struct 
> cxl_cmd *cmd,
>  } QEMU_PACKED *fw_info;
>  QEMU_BUILD_BUG_ON(sizeof(*fw_info) != 0x50);
>  
> -if (cxl_dstate->pmem_size < CXL_CAPACITY_MULTIPLIER) {
> +if (cxl_dstate->mem_size < CXL_CAPACITY_MULTIPLIER) {

I think this is also true of the individual types. Should we check each of them 
instead
much like we do below for cmd_identify_memoyr_Device.

>  return CXL_MBOX_INTERNAL_ERROR;
>  }
>  
> @@ -412,20 +412,20 @@ static ret_code cmd_identify_memory_device(struct 
> cxl_cmd *cmd,
>  
>  CXLType3Dev *ct3d = container_of(cxl_dstate, CXLType3Dev, cxl_dstate);
>  CXLType3Class *cvc = CXL_TYPE3_GET_CLASS(ct3d);
> -uint64_t size = cxl_dstate->pmem_size;
>  
> -if (!QEMU_IS_ALIGNED(size, CXL_CAPACITY_MULTIPLIER)) {
> +if ((!QEMU_IS_ALIGNED(cxl_dstate->vmem_size, CXL_CAPACITY_MULTIPLIER)) ||
> +(!QEMU_IS_ALIGNED(cxl_dstate->pmem_size, CXL_CAPACITY_MULTIPLIER))) {
>  return CXL_MBOX_INTERNAL_ERROR;
>  }
...


> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index 0317bd96a6..21e866dcaf 100644
> --- a/hw/mem/cxl_type3.c

Ugly QOM property names: paths within paths

2022-11-14 Thread Markus Armbruster
I noticed this the other day:

(qemu) info qom-tree 
/machine (pc-i440fx-7.2-machine)
  /fw_cfg (fw_cfg_io)
/\x2from@etc\x2facpi\x2frsdp[0] (memory-region)b
/\x2from@etc\x2facpi\x2ftables[0] (memory-region)
/\x2from@etc\x2ftable-loader[0] (memory-region)
/fwcfg.dma[0] (memory-region)
/fwcfg[0] (memory-region)
[...]

It took me a minute to realize that the "\x2" in these property names
are escaped forms of '/'.  I.e. the unescaped path components of the
first property path are

machine
fw_cfg
/from@etc/facpi/frsdp[0]

We're embedding paths within paths.  Ugh!

The escaping happens in memory_region_init():

static bool memory_region_need_escape(char c)
{
return c == '/' || c == '[' || c == '\\' || c == ']';
}

static char *memory_region_escape_name(const char *name)
{
const char *p;
char *escaped, *q;
uint8_t c;
size_t bytes = 0;

for (p = name; *p; p++) {
bytes += memory_region_need_escape(*p) ? 4 : 1;
}
if (bytes == p - name) {
   return g_memdup(name, bytes + 1);
}

escaped = g_malloc(bytes + 1);
for (p = name, q = escaped; *p; p++) {
c = *p;
if (unlikely(memory_region_need_escape(c))) {
*q++ = '\\';
*q++ = 'x';
*q++ = "0123456789abcdef"[c >> 4];
c = "0123456789abcdef"[c & 15];
}
*q++ = c;
}
*q = 0;
return escaped;
}

static void memory_region_do_init(MemoryRegion *mr,
  Object *owner,
  const char *name,
  uint64_t size)
{
mr->size = int128_make64(size);
if (size == UINT64_MAX) {
mr->size = int128_2_64();
}
mr->name = g_strdup(name);
mr->owner = owner;
mr->ram_block = NULL;

if (name) {
char *escaped_name = memory_region_escape_name(name);
char *name_array = g_strdup_printf("%s[*]", escaped_name);

if (!owner) {
owner = container_get(qdev_get_machine(), "/unattached");
}

object_property_add_child(owner, name_array, OBJECT(mr));
object_unref(OBJECT(mr));
g_free(name_array);
g_free(escaped_name);
}
}

void memory_region_init(MemoryRegion *mr,
Object *owner,
const char *name,
uint64_t size)
{
object_initialize(mr, sizeof(*mr), TYPE_MEMORY_REGION);
memory_region_do_init(mr, owner, name, size);
}

Goes back to

commit b4fefef9d52003b6d09866501275a9a57995c6b0
Author: Peter Crosthwaite 
Date:   Thu Jun 5 23:15:52 2014 -0700

memory: MemoryRegion: QOMify

QOMify memory regions as an Object. The former init() and destroy()
routines become instance_init() and instance_finalize() resp.

memory_region_init() is re-implemented to be:
object_initialize() + set fields

memory_region_destroy() is re-implemented to call unparent().

Signed-off-by: Peter Crosthwaite 
[Add newly-created MR as child, unparent on destruction. - Paolo]
Signed-off-by: Paolo Bonzini 

No mention of the escapery.

Questions:

1. Do we really want to embed slash-separated paths into slash-separated
   paths?

2. As far as I can tell, object.c does not guard against "funny"
   characters such as '/' in path components.  Should it?  For what it's
   worth, the kernel doesn't permit '/' in filenames.

3. Should the escapery live in object.c instead of memory.c?




Re: [PATCH v2] capstone: use instead of

2022-11-14 Thread Peter Maydell
On Sun, 13 Nov 2022 at 20:10, Michael Tokarev  wrote:
>
> The upcoming capstone 5.0 drops support for the old way
> of including its header, due to this change:
> https://github.com/capstone-engine/capstone/commit/6656bcb63ab4e87dc6079bd6b6b12cc8dd9b2ad8
> The official way is to use 
>
> This change has already been proposed before, see
> https://patchwork.kernel.org/project/qemu-devel/patch/20180215173539.11033-1-f4...@amsat.org/
> but it didn't find its way into qemu at that time.
>
> On current systems, using  works
> now (despite the pkg-config-supplied -I/usr/include/capstone) -
> since on all systems capstone headers are put into capstone/
> subdirectory of a system include dir. So this change is
> compatible with both the obsolete way of including it
> and the only future way.

That's only true if capstone happened to be installed
into a system include directory subdirectory. That
is probably true for most distros, but it isn't
necessarily true when an end user has built and
installed capstone locally themselves.

In other words, this is a breaking non-back-compatible
change by capstone upstream, which we now need to work
around somehow :-(


> diff --git a/include/disas/capstone.h b/include/disas/capstone.h
> index e29068dd97..d8fdc5d537 100644
> --- a/include/disas/capstone.h
> +++ b/include/disas/capstone.h
> @@ -3,7 +3,7 @@
>
>  #ifdef CONFIG_CAPSTONE
>
> -#include 
> +#include 
>
>  #else
>
> diff --git a/meson.build b/meson.build
> index cf3e517e56..6f34c963f7 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2680,12 +2680,7 @@ if not get_option('capstone').auto() or have_system or 
> have_user
>capstone = dependency('capstone', version: '>=3.0.5',
>  kwargs: static_kwargs, method: 'pkg-config',
>  required: get_option('capstone'))
> -
> -  # Some versions of capstone have broken pkg-config file
> -  # that reports a wrong -I path, causing the #include to
> -  # fail later. If the system has such a broken version
> -  # do not use it.
> -  if capstone.found() and not cc.compiles('#include ',
> +  if capstone.found() and not cc.compiles('#include ',
>dependencies: [capstone])
>  capstone = not_found
>  if get_option('capstone').enabled()

We can do something like

config_host_data.set('HAVE_CAPSTONE_CAPSTONE_H',
cc.has_header('capstone/capstone.h', depedencies: [capstone])

to check that this capstone really does have capstone/capstone.h,
for instance.

Dan: is there a reason why in commit 8f4aea712ffc4 you wrote
the "check that capstone.h really exists" check with cc.compiles
rather than cc.has_header ?

thanks
-- PMM



Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-14 Thread Michael S. Tsirkin
On Mon, Nov 14, 2022 at 06:15:30PM +0100, Christian Borntraeger wrote:
> 
> 
> Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:
> > On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:
> > > 
> > > 
> > > Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:
> > > > On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:
> > > > > Am 08.11.22 um 10:23 schrieb Alex Bennée:
> > > > > > The previous fix to virtio_device_started revealed a problem in its
> > > > > > use by both the core and the device code. The core code should be 
> > > > > > able
> > > > > > to handle the device "starting" while the VM isn't running to handle
> > > > > > the restoration of migration state. To solve this dual use 
> > > > > > introduce a
> > > > > > new helper for use by the vhost-user backends who all use it to 
> > > > > > feed a
> > > > > > should_start variable.
> > > > > > 
> > > > > > We can also pick up a change vhost_user_blk_set_status while we are 
> > > > > > at
> > > > > > it which follows the same pattern.
> > > > > > 
> > > > > > Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
> > > > > > virtio_device_started)
> > > > > > Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio 
> > > > > > device)
> > > > > > Signed-off-by: Alex Bennée 
> > > > > > Cc: "Michael S. Tsirkin" 
> > > > > 
> > > > > Hmmm, is this
> > > > > commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
> > > > > Author: Alex Bennée 
> > > > > AuthorDate: Mon Nov 7 12:14:07 2022 +
> > > > > Commit: Michael S. Tsirkin 
> > > > > CommitDate: Mon Nov 7 14:08:18 2022 -0500
> > > > > 
> > > > >   hw/virtio: introduce virtio_device_should_start
> > > > > 
> > > > > and older version?
> > > > 
> > > > This is what got merged:
> > > > https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
> > > > This patch was sent after I merged the RFC.
> > > > I think the only difference is the commit log but I might be missing
> > > > something.
> > > > 
> > > > > This does not seem to fix the regression that I have reported.
> > > > 
> > > > This was applied on top of 9f6bcfd99f which IIUC does, right?
> > > > 
> > > > 
> > > 
> > > QEMU master still fails for me for suspend/resume to disk:
> > > 
> > > #0  0x03ff8e3980a6 in __pthread_kill_implementation () at 
> > > /lib64/libc.so.6
> > > #1  0x03ff8e348580 in raise () at /lib64/libc.so.6
> > > #2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
> > > #3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
> > > #4  0x03ff8e340a4e in  () at /lib64/libc.so.6
> > > #5  0x02aa1ffa8966 in vhost_vsock_common_pre_save (opaque= > > out>) at ../hw/virtio/vhost-vsock-common.c:203
> > > #6  0x02aa1fe5e0ee in vmstate_save_state_v
> > >  (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0 
> > > , opaque=0x2aa21bac9f8, 
> > > vmdesc=vmdesc@entry=0x3fddc08eb30, version_id=version_id@entry=0) at 
> > > ../migration/vmstate.c:329
> > > #7  0x02aa1fe5ebf8 in vmstate_save_state (f=f@entry=0x2aa21bdc170, 
> > > vmsd=, opaque=, 
> > > vmdesc_id=vmdesc_id@entry=0x3fddc08eb30) at ../migration/vmstate.c:317
> > > #8  0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170, 
> > > se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at 
> > > ../migration/savevm.c:908
> > > #9  0x02aa1fe79584 in qemu_savevm_state_complete_precopy_non_iterable 
> > > (f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false, 
> > > inactivate_disks=inactivate_disks@entry=true)
> > >  at ../migration/savevm.c:1393
> > > #10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy 
> > > (f=0x2aa21bdc170, iterable_only=iterable_only@entry=false, 
> > > inactivate_disks=inactivate_disks@entry=true) at 
> > > ../migration/savevm.c:1459
> > > #11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
> > > ../migration/migration.c:3314
> > > #12 migration_iteration_run (s=0x2aa218ef600) at 
> > > ../migration/migration.c:3761
> > > #13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
> > > ../migration/migration.c:3989
> > > #14 0x02aa201f0b8c in qemu_thread_start (args=) at 
> > > ../util/qemu-thread-posix.c:505
> > > #15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
> > > #16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6
> > > 
> > > Michael, your previous branch did work if I recall correctly.
> > 
> > That one was failing under github CI though (for reasons we didn't
> > really address, such as disconnect during stop causing a recursive
> > call to stop, but there you are).
> Even the double revert of everything?

I don't remember at this point.

> So how do we proceed now?

I'm hopeful Alex will come up with a fix.

-- 
MST




[PATCH v2 1/3] nvme: Move adjustment of data_units{read,written}

2022-11-14 Thread Joel Granados
In order to return the units_{read/written} required by the SMART log we
need to shift the number of bytes value by BDRV_SECTORS_BITS and multiply
by 1000. This is a prep patch that moves this adjustment to where the SMART
log is calculated in order to use the stats struct for calculating OCP
extended smart log values.

Signed-off-by: Joel Granados 
---
 hw/nvme/ctrl.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 87aeba0564..220683201a 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4449,8 +4449,8 @@ static void nvme_set_blk_stats(NvmeNamespace *ns, struct 
nvme_stats *stats)
 {
 BlockAcctStats *s = blk_get_stats(ns->blkconf.blk);
 
-stats->units_read += s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
-stats->units_written += s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
+stats->units_read += s->nr_bytes[BLOCK_ACCT_READ];
+stats->units_written += s->nr_bytes[BLOCK_ACCT_WRITE];
 stats->read_commands += s->nr_ops[BLOCK_ACCT_READ];
 stats->write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
 }
@@ -4490,10 +4490,12 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t 
rae, uint32_t buf_len,
 trans_len = MIN(sizeof(smart) - off, buf_len);
 smart.critical_warning = n->smart_critical_warning;
 
-smart.data_units_read[0] = cpu_to_le64(DIV_ROUND_UP(stats.units_read,
-1000));
-smart.data_units_written[0] = cpu_to_le64(DIV_ROUND_UP(stats.units_written,
-   1000));
+smart.data_units_read[0] = cpu_to_le64(DIV_ROUND_UP(
+   stats.units_read >> 
BDRV_SECTOR_BITS,
+   1000));
+smart.data_units_written[0] = cpu_to_le64(DIV_ROUND_UP(
+  stats.units_written >> 
BDRV_SECTOR_BITS,
+  1000));
 smart.host_read_commands[0] = cpu_to_le64(stats.read_commands);
 smart.host_write_commands[0] = cpu_to_le64(stats.write_commands);
 
-- 
2.30.2




Re: [PATCH for 7.2-rc1 v2 00/12] testing, docs, plugins, arm pre-PR

2022-11-14 Thread Alex Bennée


Alex Bennée  writes:

> Hi,
>
> This is my pre-PR series for the pull request I'm going to send on
> Monday in time for Tuesday tagging of rc1. Anything not reviewed will
> get dropped from the PR (which probably includes the GICD_IIDR which
> was just an annoyance I noticed while debugging Xen for another
> series). The following still need review:
>
>  - hw/intc: add implementation of GICD_IIDR to Arm GIC
>  - gitlab: integrate coverage report (1 acks, 1 sobs)
>  - tests/plugins: add a new vcpu state tracking plugin
>  - tests/docker: allow user to override check target

Any more before I roll the PR?

-- 
Alex Bennée



[PATCH 1/1] hw/arm/aspeed: Automatically zero-extend flash images

2022-11-14 Thread Peter Delevoryas
One thing that's really annoying about the Aspeed machines is that you
have to provide a flash image that is the same size as the SPI-NOR flash
device the board uses, or something larger. If you don't, you'll get
this obscure error message:

qemu-system-aarch64: failed to read the initial flash content

Which is just because the file isn't the right size. Zero-extending the
file to 128MB (the largest SPI-NOR flash size) fixes this.

This commit just performs the zero-extend automatically, so people don't
have to maintain bash scripts for it. And if your bash script does this
already, it will be a no-op. And, if your firmware image is larger than
the SPI-NOR device, then it will not truncate it.

Signed-off-by: Peter Delevoryas 
---
 hw/arm/aspeed.c | 40 +++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
index 55f114ef72..26450d90db 100644
--- a/hw/arm/aspeed.c
+++ b/hw/arm/aspeed.c
@@ -260,6 +260,30 @@ static void write_boot_rom(DriveInfo *dinfo, hwaddr addr, 
size_t rom_size,
 rom_add_blob_fixed("aspeed.boot_rom", storage, rom_size, addr);
 }
 
+static int zero_extend_block_device(BlockBackend *blk, uint64_t size,
+Error **errp)
+{
+uint64_t perm, shared_perm;
+
+blk_get_perm(blk, , _perm);
+
+if (blk_set_perm(blk, BLK_PERM_ALL, BLK_PERM_ALL, errp)) {
+error_append_hint(errp, "Unable to change permissions on block 
device");
+return -1;
+}
+if (blk_truncate(blk, size, true, PREALLOC_MODE_OFF, BDRV_REQ_ZERO_WRITE,
+ errp)) {
+error_append_hint(errp, "Unable to zero-extend block device");
+return -1;
+}
+if (blk_set_perm(blk, perm, shared_perm, errp)) {
+error_append_hint(errp,
+  "Unable to restore permissions on block device");
+/* Ignore error since we successfully extended the device already */
+}
+return 0;
+}
+
 void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
   unsigned int count, int unit0)
 {
@@ -273,10 +297,24 @@ void aspeed_board_init_flashes(AspeedSMCState *s, const 
char *flashtype,
 DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i);
 qemu_irq cs_line;
 DeviceState *dev;
+AspeedSMCFlash *flash = >flashes[i];
+uint64_t flash_size = memory_region_size(>mmio);
 
 dev = qdev_new(flashtype);
 if (dinfo) {
-qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
+BlockBackend *blk = blk_by_legacy_dinfo(dinfo);
+int64_t blk_size = blk_getlength(blk);
+
+if (blk_size > 0 && blk_size < (int64_t)flash_size) {
+Error *err = NULL;
+
+zero_extend_block_device(blk, flash_size, );
+if (err) {
+warn_reportf_err(err, "Error zero-extending MTD drive[%d] "
+ "to flash size", i);
+}
+}
+qdev_prop_set_drive(dev, "drive", blk);
 }
 qdev_realize_and_unref(dev, BUS(s->spi), _fatal);
 
-- 
2.38.1




Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-14 Thread Alex Bennée


Christian Borntraeger  writes:

> Am 08.11.22 um 10:23 schrieb Alex Bennée:
>> The previous fix to virtio_device_started revealed a problem in its
>> use by both the core and the device code. The core code should be able
>> to handle the device "starting" while the VM isn't running to handle
>> the restoration of migration state. To solve this dual use introduce a
>> new helper for use by the vhost-user backends who all use it to feed a
>> should_start variable.
>> We can also pick up a change vhost_user_blk_set_status while we are
>> at
>> it which follows the same pattern.
>> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to
>> virtio_device_started)
>> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
>> Signed-off-by: Alex Bennée 
>> Cc: "Michael S. Tsirkin" 
>
> Hmmm, is this
> commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
> Author: Alex Bennée 
> AuthorDate: Mon Nov 7 12:14:07 2022 +
> Commit: Michael S. Tsirkin 
> CommitDate: Mon Nov 7 14:08:18 2022 -0500
>
> hw/virtio: introduce virtio_device_should_start
>
> and older version?

Only missing the additional Fixes line MST suggested in the review. I
should have made it clearer in the --- comment.

Which test is failing?

-- 
Alex Bennée



Re: [PATCH v2 12/15] migration: Move last_sent_block into PageSearchStatus

2022-11-14 Thread Juan Quintela
Peter Xu  wrote:
> Since we use PageSearchStatus to represent a channel, it makes perfect
> sense to keep last_sent_block (aka, leverage RAM_SAVE_FLAG_CONTINUE) to be
> per-channel rather than global because each channel can be sending
> different pages on ramblocks.
>
> Hence move it from RAMState into PageSearchStatus.
>
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 




Re: [PATCH v2 02/12] tests/avocado: improve behaviour waiting for login prompts

2022-11-14 Thread Philippe Mathieu-Daudé

On 14/11/22 17:28, Peter Maydell wrote:

On Fri, 11 Nov 2022 at 14:58, Alex Bennée  wrote:


This attempts to deal with the problem of login prompts not being
guaranteed to be terminated with a newline. The solution to this is to
peek at the incoming data looking to see if we see an up-coming match
before we fall back to the old readline() logic. The reason to mostly
rely on readline is because I am occasionally seeing the peek stalling
despite data being there.

This seems kinda hacky and gross so I'm open to alternative approaches
and cleaner python code.

Signed-off-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 


With this patch, the evb_sdk test fails:

Fetching asset from
./build/arm-clang/tests/avocado/machine_aspeed.py:AST2x00MachineSDK.test_arm_ast2500_evb_sdk
JOB ID : 542e050c4f7e1ddd6d5cdd350e4c26e1bdfcdee4
JOB LOG: 
/home/petmay01/avocado/job-results/job-2022-11-14T16.21-542e050/job.log
  (1/1) 
./build/arm-clang/tests/avocado/machine_aspeed.py:AST2x00MachineSDK.test_arm_ast2500_evb_sdk:
ERROR: log() missing 1 required positional argument: 'msg' (82.57 s)
RESULTS: PASS 0 | ERROR 1 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0
| CANCEL 0
JOB TIME   : 84.09 s

The avocado log reports a traceback where Python has thrown a
UnicodeDecodeError, and then subsequently an attempted debug
message in the error-handling path has a syntax error
("log() missing 1 required positional argument"):



_console_interaction(test, success_message, failure_message, None,
vm=vm)
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/avocad
o/avocado_qemu/__init__.py", line 226, in _console_interaction
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR| msg =
_peek_ahead(console, min_match, success_message)
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|   File
"/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/tests/avocad
o/avocado_qemu/__init__.py", line 180, in _peek_ahead
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR|
console_logger.log("error in decode of peek")
2022-11-14 16:22:48,573 stacktrace   L0045 ERROR| TypeError: log()
missing 1 required positional argument: 'msg'


Indeed, log() expects a Level as first argument. Here we simply want to
report the exception as a warning and continue:

-- >8 --
 except UnicodeDecodeError:
-console_logger.log("error in decode of peek")
+console_logger.warning("error in decode of peek")
 return None
---



Re: [PATCH v2 3/5] migration: Disallow postcopy preempt to be used with compress

2022-11-14 Thread Juan Quintela
Peter Xu  wrote:
> The preempt mode requires the capability to assign channel for each of the
> page, while the compression logic will currently assign pages to different
> compress thread/local-channel so potentially they're incompatible.
>
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 




Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-14 Thread Vlastimil Babka
On 11/1/22 16:19, Michael Roth wrote:
> On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
>> > 
>> >   1) restoring kernel directmap:
>> > 
>> >  Currently SNP (and I believe TDX) need to either split or remove 
>> > kernel
>> >  direct mappings for restricted PFNs, since there is no guarantee that
>> >  other PFNs within a 2MB range won't be used for non-restricted
>> >  (which will cause an RMP #PF in the case of SNP since the 2MB
>> >  mapping overlaps with guest-owned pages)
>> 
>> Has the splitting and restoring been a well-discussed direction? I'm
>> just curious whether there is other options to solve this issue.
> 
> For SNP it's been discussed for quite some time, and either splitting or
> removing private entries from directmap are the well-discussed way I'm
> aware of to avoid RMP violations due to some other kernel process using
> a 2MB mapping to access shared memory if there are private pages that
> happen to be within that range.
> 
> In both cases the issue of how to restore directmap as 2M becomes a
> problem.
> 
> I was also under the impression TDX had similar requirements. If so,
> do you know what the plan is for handling this for TDX?
> 
> There are also 2 potential alternatives I'm aware of, but these haven't
> been discussed in much detail AFAIK:
> 
> a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
>request 2MB THP pages, but I'm not sure how reliably we can guarantee
>that enough THPs are available, so if we went that route we'd probably
>be better off requiring the use of hugetlbfs as the backing store. But
>obviously that's a bit limiting and it would be nice to have the option
>of using normal pages as well. One nice thing with invalidation
>scheme proposed here is that this would "Just Work" if implement
>hugetlbfs support, so an admin that doesn't want any directmap
>splitting has this option available, otherwise it's done as a
>best-effort.
> 
> b) Implement general support for restoring directmap as 2M even when
>subpages might be in use by other kernel threads. This would be the
>most flexible approach since it requires no special handling during
>invalidations, but I think it's only possible if all the CPA
>attributes for the 2M range are the same at the time the mapping is
>restored/unsplit, so some potential locking issues there and still
>chance for splitting directmap over time.

I've been hoping that

c) using a mechanism such as [1] [2] where the goal is to group together
these small allocations that need to increase directmap granularity so
maximum number of large mappings are preserved. But I guess that means
knowing at allocation time that this will happen. So I've been wondering how
this would be possible to employ in the SNP/UPM case? I guess it depends on
how we expect the private/shared conversions to happen in practice, and I
don't know the details. I can imagine the following complications:

- a memfd_restricted region is created such that it's 2MB large/aligned,
i.e. like case a) above, we can allocate it normally. Now, what if a 4k page
in the middle is to be temporarily converted to shared for some
communication between host and guest (can such thing happen?). With the
punch hole approach, I wonder if we end up fragmenting directmap
unnecessarily? IIUC the now shared page will become backed by some other
page (as the memslot supports both private and shared pages simultaneously).
But does it make sense to really split the direct mapping (and e.g. the
shmem page?) We could leave the whole 2MB unmapped without splitting if we
didn't free the private 4k subpage.

- a restricted region is created that's below 2MB. If something like [1] is
merged, it could be used for the backing pages to limit directmap
fragmentation. But then in case it's eventually fallocated to become larger
and gain one more more 2MB aligned ranges, the result is suboptimal. Unless
in that case we migrate the existing pages to a THP-backed shmem, kinda like
khugepaged collapses hugepages. But that would have to be coordinated with
the guest, maybe not even possible?

[1] https://lore.kernel.org/all/20220127085608.306306-1-r...@kernel.org/
[2] https://lwn.net/Articles/894557/

>> 
>> > 
>> >  Previously we were able to restore 2MB mappings to some degree
>> >  since both shared/restricted pages were all pinned, so anything
>> >  backed by a THP (or hugetlb page once that is implemented) at guest
>> >  teardown could be restored as 2MB direct mapping.
>> > 
>> >  Invalidation seems like the most logical time to have this happen,
>> 
>> Currently invalidation only happens at user-initiated fallocate(). It
>> does not cover the VM teardown case where the restoring might also be
>> expected to be handled.
> 
> Right, I forgot to add that in my proposed changes I added invalidations
> for any still-allocated private pages present when the restricted memfd
> 

[PATCH v2 2/3] nvme: Add ocp to the subsys

2022-11-14 Thread Joel Granados
The Open Compute Project defines a Datacenter NVMe SSD Spec that sits on
top of the NVMe spec. Additional commands and NVMe behaviors specific for
the Datacenter. This is a preparation patch that introduces an argument to
activate OCP in nvme.

Signed-off-by: Joel Granados 
---
 hw/nvme/nvme.h   | 1 +
 hw/nvme/subsys.c | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 79f5c281c2..aa99c0c57c 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -56,6 +56,7 @@ typedef struct NvmeSubsystem {
 
 struct {
 char *nqn;
+bool ocp;
 } params;
 } NvmeSubsystem;
 
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 9d2643678b..ecca28449c 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -129,8 +129,8 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
**errp)
 
 static Property nvme_subsystem_props[] = {
 DEFINE_PROP_STRING("nqn", NvmeSubsystem, params.nqn),
-DEFINE_PROP_END_OF_LIST(),
-};
+DEFINE_PROP_BOOL("ocp", NvmeSubsystem, params.ocp, false),
+DEFINE_PROP_END_OF_LIST(), };
 
 static void nvme_subsys_class_init(ObjectClass *oc, void *data)
 {
-- 
2.30.2




[PATCH] hw/intc/arm_gicv3: fix prio masking on pmr write

2022-11-14 Thread Jens Wiklander
With commit 39f29e599355 ("hw/intc/arm_gicv3: Use correct number of
priority bits for the CPU") the number of priority bits was changed from
the maximum value 8 to typically 5. As a consequence a few of the lowest
bits in ICC_PMR_EL1 becomes RAZ/WI. However prior to this patch one of
these bits was still used since the supplied priority value is masked
before it's eventually right shifted with one bit. So the bit is not
lost as one might expect when the register is read again.

The Linux kernel depends on lowest valid bit to be reset to zero, see
commit 33625282adaa ("irqchip/gic-v3: Probe for SCR_EL3 being clear
before resetting AP0Rn") for details.

So fix this by masking the priority value after it may have been right
shifted by one bit.

Fixes: 39f29e599355 ("hw/intc/arm_gicv3: Use correct number of priority bits 
for the CPU")
Signed-off-by: Jens Wiklander 
---
Hi,

I've only tested this patch on top of v7.1.0 since I couldn't get current
to run in my test setup.

In case anyone wonders what I'm testing, it's a setup with Hafnium at
S-EL2, OP-TEE at S-EL1 and the Linux kernel at NS-EL1 (no NS-EL2 for
simplicity).

Regards,
Jens

 hw/intc/arm_gicv3_cpuif.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index 8ca630e5ad1e..b17b29288c73 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -1016,8 +1016,6 @@ static void icc_pmr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 
 trace_gicv3_icc_pmr_write(gicv3_redist_affid(cs), value);
 
-value &= icc_fullprio_mask(cs);
-
 if (arm_feature(env, ARM_FEATURE_EL3) && !arm_is_secure(env) &&
 (env->cp15.scr_el3 & SCR_FIQ)) {
 /* NS access and Group 0 is inaccessible to NS: return the
@@ -1029,6 +1027,7 @@ static void icc_pmr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 }
 value = (value >> 1) | 0x80;
 }
+value &= icc_fullprio_mask(cs);
 cs->icc_pmr_el1 = value;
 gicv3_cpuif_update(cs);
 }
-- 
2.31.1




Re: [PULL 0/2] target-arm queue for rc1

2022-11-14 Thread Stefan Hajnoczi
On Mon, Nov 14, 2022 at 03:51:59PM +, Peter Maydell wrote:
> Hi; here's the arm pullreq for rc1. Just one bugfix and
> a MAINTAINERS file update...
> 
> thanks
> -- PMM
> 
> The following changes since commit 305f6f62d9d250a32cdf090ddcb7e3a5b26a342e:
> 
>   Merge tag 'pull-la-20221112' of https://gitlab.com/rth7680/qemu into 
> staging (2022-11-12 09:17:06 -0500)
> 
> are available in the Git repository at:
> 
>   https://git.linaro.org/people/pmaydell/qemu-arm.git 
> tags/pull-target-arm-20221114
> 
> for you to fetch changes up to d9721f19cd05a382f4f5a7093c80d1c4a8a1aa82:
> 
>   hw/intc/arm_gicv3: fix prio masking on pmr write (2022-11-14 15:10:58 +)

Applied, thanks!

Stefan

> 
> 
> target-arm queue:
>  * hw/intc/arm_gicv3: fix prio masking on pmr write
>  * MAINTAINERS: Update maintainer's email for Xilinx CAN
> 
> 
> Jens Wiklander (1):
>   hw/intc/arm_gicv3: fix prio masking on pmr write
> 
> Vikram Garhwal (1):
>   MAINTAINERS: Update maintainer's email for Xilinx CAN
> 
>  hw/intc/arm_gicv3_cpuif.c | 3 +--
>  MAINTAINERS   | 4 ++--
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 


signature.asc
Description: PGP signature


Re: biosbits test failing on origin/master

2022-11-14 Thread John Snow
On Thu, Nov 10, 2022 at 11:22 PM Ani Sinha  wrote:
>
> On Thu, Nov 10, 2022 at 11:37 PM John Snow  wrote:
> >
> > Hiya, on today's origin/master
> > (2ccad61746ca7de5dd3e25146062264387e43bd4) I'm finding that "make
> > check-avocado" is failing on the new biosbits test on my local
> > development machine:
> >
> >  (001/193) tests/avocado/acpi-bits.py:AcpiBitsTest.test_acpi_smbios_bits:
> > FAIL: True is not false : The VM seems to have failed to shutdown in
> > time (83.65 s)
> >
> > Is this a known issue, or should I begin to investigate it?
>
> In my test environment it does pass.
>
> $ ./tests/venv/bin/avocado run -t acpi tests/avocado
> Fetching asset from
> tests/avocado/acpi-bits.py:AcpiBitsTest.test_acpi_smbios_bits
> JOB ID : 35726df7d3c2e0f41847822620c78195ba45b9b9
> JOB LOG: 
> /home/anisinha/avocado/job-results/job-2022-11-11T09.42-35726df/job.log
>  (1/1) tests/avocado/acpi-bits.py:AcpiBitsTest.test_acpi_smbios_bits:
> PASS (57.57 s)
> RESULTS: PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0
> | CANCEL 0
> JOB TIME   : 63.82 s
>
> However, I have seen that on certain slower test machines or when run
> within a virtual machine, the test can take longer to complete and 60
> secs may not always be enough. In those cases raising the maximum
> completion time to 90 secs helps. Perhaps you can try this and let me
> know if it helps:

Hmm - I'm running on a fairly modern machine and not in a VM. Do you
have an invocation to share that exists outside of the avocado
machinery where I could test this individually and see how long it
might take to complete if I just let it run? I am worried that it's
getting wedged instead of just taking a long time, but it's hard to
tell.

--js

>
> diff --git a/tests/avocado/acpi-bits.py b/tests/avocado/acpi-bits.py
> index 8745a58a76..b11fe39350 100644
> --- a/tests/avocado/acpi-bits.py
> +++ b/tests/avocado/acpi-bits.py
> @@ -385,8 +385,9 @@ def test_acpi_smbios_bits(self):
>  self._vm.launch()
>  # biosbits has been configured to run all the specified test suites
>  # in batch mode and then automatically initiate a vm shutdown.
> -# sleep for maximum of one minute
> -max_sleep_time = time.monotonic() + 60
> +# sleep for a maximum of one and half minutes to accommodate
> running this
> +# even on slower machines.
> +max_sleep_time = time.monotonic() + 90
>  while self._vm.is_running() and time.monotonic() < max_sleep_time:
>  time.sleep(1)
>




[PATCH] target/mips: Properly set C0_CMGCRBase after CPU reset

2022-11-14 Thread Jiaxun Yang
Value of C0_CMGCRBase will be reseted to default when cpu reset
happens. In some cases software may move GCR base and then initiate
a CPU reset, this will leave C0_CMGCRBase of reseted core incorrect.

Implement a callback in CMGCR device to allow C0_CMGCRBase and other
global states to be overriden after CPU reset.

Signed-off-by: Jiaxun Yang 
---
This fixes SMP boot for Boston board.
I'm not sure if it's the best palce to make such a callback,
but we can add more global states such as BEV here in future.
---
 hw/mips/cps.c| 3 ++-
 hw/misc/mips_cmgcr.c | 5 +
 target/mips/cpu.c| 4 +++-
 target/mips/cpu.h| 4 
 4 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/hw/mips/cps.c b/hw/mips/cps.c
index 2b436700ce..29b10ff8d0 100644
--- a/hw/mips/cps.c
+++ b/hw/mips/cps.c
@@ -98,6 +98,7 @@ static void mips_cps_realize(DeviceState *dev, Error **errp)
 cpu_mips_clock_init(cpu);
 
 env = >env;
+env->gcr = >gcr;
 if (cpu_mips_itu_supported(env)) {
 itu_present = true;
 /* Attach ITC Tag to the VP */
@@ -158,7 +159,7 @@ static void mips_cps_realize(DeviceState *dev, Error **errp)
 sysbus_mmio_get_region(SYS_BUS_DEVICE(>gic), 
0));
 
 /* Global Configuration Registers */
-gcr_base = env->CP0_CMGCRBase << 4;
+gcr_base = GCR_BASE_ADDR;
 
 object_initialize_child(OBJECT(dev), "gcr", >gcr, TYPE_MIPS_GCR);
 object_property_set_int(OBJECT(>gcr), "num-vp", s->num_vp,
diff --git a/hw/misc/mips_cmgcr.c b/hw/misc/mips_cmgcr.c
index 3c8b37f700..f2108b7d32 100644
--- a/hw/misc/mips_cmgcr.c
+++ b/hw/misc/mips_cmgcr.c
@@ -19,6 +19,11 @@
 #include "hw/qdev-properties.h"
 #include "hw/intc/mips_gic.h"
 
+void gcr_cpu_reset(struct MIPSGCRState *s, CPUMIPSState *env)
+{
+env->CP0_CMGCRBase = s->gcr_base >> 4;
+}
+
 static inline bool is_cpc_connected(MIPSGCRState *s)
 {
 return s->cpc_mr != NULL;
diff --git a/target/mips/cpu.c b/target/mips/cpu.c
index e997c1b9cb..d0a76b95f7 100644
--- a/target/mips/cpu.c
+++ b/target/mips/cpu.c
@@ -297,7 +297,9 @@ static void mips_cpu_reset(DeviceState *dev)
 env->CP0_EBase |= (int32_t)0x8000;
 }
 if (env->CP0_Config3 & (1 << CP0C3_CMGCR)) {
-env->CP0_CMGCRBase = 0x1fbf8000 >> 4;
+if (env->gcr) {
+gcr_cpu_reset(env->gcr, env);
+}
 }
 env->CP0_EntryHi_ASID_mask = (env->CP0_Config5 & (1 << CP0C5_MI)) ?
 0x0 : (env->CP0_Config4 & (1 << CP0C4_AE)) ? 0x3ff : 0xff;
diff --git a/target/mips/cpu.h b/target/mips/cpu.h
index 0a085643a3..c345e6b1c7 100644
--- a/target/mips/cpu.h
+++ b/target/mips/cpu.h
@@ -1154,6 +1154,7 @@ typedef struct CPUArchState {
 CPUMIPSTLBContext *tlb;
 void *irq[8];
 struct MIPSITUState *itu;
+struct MIPSGCRState *gcr;
 MemoryRegion *itc_tag; /* ITC Configuration Tags */
 #endif
 
@@ -1310,6 +1311,9 @@ void cpu_mips_soft_irq(CPUMIPSState *env, int irq, int 
level);
 /* mips_itu.c */
 void itc_reconfigure(struct MIPSITUState *tag);
 
+/* mips_cmgcr.c */
+void gcr_cpu_reset(struct MIPSGCRState *s, CPUMIPSState *env);
+
 #endif /* !CONFIG_USER_ONLY */
 
 /* helper.c */
-- 
2.37.4




Re: [PATCH 04/13] block: Remove drained_end_counter

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

drained_end_counter is unused now, nobody changes its value any more. It
can be removed.

In cases where we had two almost identical functions that only differed
in whether the caller passes drained_end_counter, or whether they would
poll for a local drained_end_counter to reach 0, these become a single
function.

Signed-off-by: Kevin Wolf 
---
  include/block/block-io.h | 15 -
  include/block/block_int-common.h |  6 +-
  block.c  |  5 +-
  block/block-backend.c|  4 +-
  block/io.c   | 97 
  blockjob.c   |  2 +-
  6 files changed, 30 insertions(+), 99 deletions(-)


The comments on bdrv_drained_end() and bdrv_parent_drained_end_single() 
in include/block/block-io.h still say that they poll some AioContext 
“which may result in a graph change”.  That’s no longer the case, 
though, so those paragraphs should be dropped, I think.


Apart from that, looks good.

Hanna




Re: [PATCH v2 2/8] target/riscv: add support for Zca and Zcf extensions

2022-11-14 Thread weiwei



On 2022/11/14 05:40, Richard Henderson wrote:

On 11/13/22 12:32, Weiwei Li wrote:

+    } else if ((get_xl_max(ctx) == MXL_RV32) &&
+    !ctx->cfg_ptr->ext_zcf &&
+    (((opcode & 0xe003) == 0x6000) ||
+ ((opcode & 0xe003) == 0x6002) ||
+ ((opcode & 0xe003) == 0xe000) ||
+ ((opcode & 0xe003) == 0xe002))) {
  gen_exception_illegal(ctx);


Why aren't you using the same c_flw solution that you do for Zcd?

Yeah, it's OK for zcf intructions to use the c_flw solution.

I tried to remain the original logic for Zcf and Zcd instructions, 
However, this way is not suitable for Zcd instructions


since zcmp/zcmt instructions will overlap their encodings(but not the 
same).  So I changed the way of Zcd instructions.


Regards,

Weiwei Li




r~





Re: [PATCH v2 07/15] migration: Use atomic ops properly for page accountings

2022-11-14 Thread Juan Quintela
Peter Xu  wrote:
> To prepare for thread-safety on page accountings, at least below counters
> need to be accessed only atomically, they are:
>
> ram_counters.transferred
> ram_counters.duplicate
> ram_counters.normal
> ram_counters.postcopy_bytes
>
> There are a lot of other counters but they won't be accessed outside
> migration thread, then they're still safe to be accessed without atomic
> ops.
>
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 

NICE!!




Re: [PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

In order to make sure that bdrv_replace_child_noperm() doesn't have to
poll any more, get rid of the bdrv_parent_drained_begin_single() call.

This is possible now because we can require that the child is already
drained when the function is called (it better be, having in-flight
requests while modifying the graph isn't going to end well!) and we
don't call the parent drain callbacks more than once.

The additional drain calls needed in callers cause the test case to run
its code in the drain handler too early (bdrv_attach_child() drains
now), so modify it to only enable the code after the test setup has
completed.

Signed-off-by: Kevin Wolf 
---
  include/block/block-io.h |  8 
  block.c  | 72 +---
  block/io.c   |  2 +-
  tests/unit/test-bdrv-drain.c | 10 +
  4 files changed, 70 insertions(+), 22 deletions(-)


I find this change complicated.  I understand it’s the point of the 
series, but I find it difficult to grasp.  But I guess there can be no 
drain series without such a patch.


As usual, I was very skeptical of the code at first, and over time 
slowly realized that I’m mostly confused by the comments, and the code 
seems fine.  Ah, well.


[...]


diff --git a/block.c b/block.c
index 5f5f79cd16..12039e9b8a 100644
--- a/block.c
+++ b/block.c


[...]


@@ -2414,12 +2428,20 @@ static TransactionActionDrv bdrv_replace_child_drv = {
   *
   * Note: real unref of old_bs is done only on commit.
   *
+ * Both child and new_bs (if non-NULL) must be drained. new_bs must be kept
+ * drained until the transaction is completed (this automatically implies that
+ * child remains drained, too).


I find “child” extremely ambiguous.  The problem is that there generally 
is no way to drain a BdrvChild object, is there?  You can only drain the 
BDS in it, which then drains the parent through the BdrvChild object.  
Historically, I don’t think there was ever a place where we cared about 
the BdrvChild object between the two to be drained, was there?  I mean, 
now there apparently is, in bdrv_child_attach_common(), but that’s a 
different story.


So the problem is that “draining a BdrvChild object” generally appears 
in the context of bdrv_parent_drained_*() functions, i.e. actually 
functions draining the parent.  Which makes it a bit confusing to refer 
to a BdrvChild object just as “child”.


I know that “child” here refers to the variable (or does it not?), but 
that is why I really prefer marking variables that are just plain 
English words, e.g. as @child or `child`, so it’s clear they are a name 
and not a noun.


In any case, because the concept is generally to drain the `child->bs` 
instead of the BdrvChild object directly, I understand the comment to 
mean: “Both the old child (`child->bs`) and `new_bs` (if non-NULL) must 
be drained.  `new_bs` must be kept drained until the transaction is 
completed.  This implies that the parent too will be kept drained until 
the transaction is completed by the BdrvChild object `child`.”


Or am I misunderstanding something, and the distinction between `child` 
and `child->bs` and the parent node is important here? (Would be good to 
know. :))



+ *
   * The function doesn't update permissions, caller is responsible for this.
   */
  static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState 
*new_bs,
  Transaction *tran)
  {
  BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
+
+assert(child->parent_quiesce_counter);
+assert(!new_bs || new_bs->quiesce_counter);
+
  *s = (BdrvReplaceChildState) {
  .child = child,
  .old_bs = child->bs,
@@ -2818,6 +2840,12 @@ static void bdrv_replace_child_noperm(BdrvChild *child,


This function now has its callers fulfill kind of a complicated 
contract.  I would prefer that to be written out in a doc comment, 
especially because it sounds like the assertions can’t cover everything 
(i.e. callers that remove a child are required to have stopped issuing 
requests to that child, but they are free to do that in any way they 
want, so no assertion will check for it here).



  int new_bs_quiesce_counter;
  
  assert(!child->frozen);

+/*
+ * When removing the child, it's the callers responsibility to make sure
+ * that no requests are in flight any more. Usually the parent is drained,
+ * but not through child->parent_quiesce_counter.
+ */


When I see a comment above an assertion, I immediately assume it is 
going to describe what the assertion checks.  Unless I’m 
misunderstanding something (again?), this comment actually describes 
what the assertion *does not* check.  I find that confusing, especially 
because the comment leads with “it’s the caller’s responsibility”, which 
to me implies “and that’s why we check it here in this assertion”, 
because assertions are there to verify that contracts are 

Re: [PATCH v2] capstone: use instead of

2022-11-14 Thread Daniel P . Berrangé
On Mon, Nov 14, 2022 at 11:59:31AM +, Peter Maydell wrote:
> On Sun, 13 Nov 2022 at 20:10, Michael Tokarev  wrote:
> >
> > The upcoming capstone 5.0 drops support for the old way
> > of including its header, due to this change:
> > https://github.com/capstone-engine/capstone/commit/6656bcb63ab4e87dc6079bd6b6b12cc8dd9b2ad8
> > The official way is to use 
> >
> > This change has already been proposed before, see
> > https://patchwork.kernel.org/project/qemu-devel/patch/20180215173539.11033-1-f4...@amsat.org/
> > but it didn't find its way into qemu at that time.
> >
> > On current systems, using  works
> > now (despite the pkg-config-supplied -I/usr/include/capstone) -
> > since on all systems capstone headers are put into capstone/
> > subdirectory of a system include dir. So this change is
> > compatible with both the obsolete way of including it
> > and the only future way.
> 
> That's only true if capstone happened to be installed
> into a system include directory subdirectory. That
> is probably true for most distros, but it isn't
> necessarily true when an end user has built and
> installed capstone locally themselves.
> 
> In other words, this is a breaking non-back-compatible
> change by capstone upstream, which we now need to work
> around somehow :-(
> 
> 
> > diff --git a/include/disas/capstone.h b/include/disas/capstone.h
> > index e29068dd97..d8fdc5d537 100644
> > --- a/include/disas/capstone.h
> > +++ b/include/disas/capstone.h
> > @@ -3,7 +3,7 @@
> >
> >  #ifdef CONFIG_CAPSTONE
> >
> > -#include 
> > +#include 
> >
> >  #else
> >
> > diff --git a/meson.build b/meson.build
> > index cf3e517e56..6f34c963f7 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -2680,12 +2680,7 @@ if not get_option('capstone').auto() or have_system 
> > or have_user
> >capstone = dependency('capstone', version: '>=3.0.5',
> >  kwargs: static_kwargs, method: 'pkg-config',
> >  required: get_option('capstone'))
> > -
> > -  # Some versions of capstone have broken pkg-config file
> > -  # that reports a wrong -I path, causing the #include to
> > -  # fail later. If the system has such a broken version
> > -  # do not use it.
> > -  if capstone.found() and not cc.compiles('#include ',
> > +  if capstone.found() and not cc.compiles('#include ',
> >dependencies: [capstone])
> >  capstone = not_found
> >  if get_option('capstone').enabled()
> 
> We can do something like
> 
> config_host_data.set('HAVE_CAPSTONE_CAPSTONE_H',
> cc.has_header('capstone/capstone.h', depedencies: [capstone])
> 
> to check that this capstone really does have capstone/capstone.h,
> for instance.
> 
> Dan: is there a reason why in commit 8f4aea712ffc4 you wrote
> the "check that capstone.h really exists" check with cc.compiles
> rather than cc.has_header ?

I was probably just unaware of 'has_header' existing

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()

2022-11-14 Thread Hanna Reitz

On 08.11.22 13:37, Kevin Wolf wrote:

We want to change .bdrv_co_drained_begin() back to be a non-coroutine
callback, so in preparation, avoid yielding in its implementation.

Because we increase bs->in_flight and bdrv_drained_begin() polls, the
behaviour is unchanged.

Signed-off-by: Kevin Wolf 
---
  block/qed.c | 20 +---
  1 file changed, 17 insertions(+), 3 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH v2 01/15] migration: Take bitmap mutex when completing ram migration

2022-11-14 Thread Juan Quintela
Peter Xu  wrote:
> Any call to ram_find_and_save_block() needs to take the bitmap mutex.  We
> used to not take it for most of ram_save_complete() because we thought
> we're the only one left using the bitmap, but it's not true after the
> preempt full patchset applied, since the return path can be taking it too.
>
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 




[PATCH v2] tests/stream-under-throttle: New test

2022-11-14 Thread Hanna Reitz
Test streaming a base image into the top image underneath two throttle
nodes.  This was reported to make qemu 7.1 hang
(https://gitlab.com/qemu-project/qemu/-/issues/1215), so this serves as
a regression test.

Signed-off-by: Hanna Reitz 
---
Based-on: <20221107151321.211175-1-hre...@redhat.com>

v1: https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00368.html

v2:
- Replace `asyncio.exceptions.TimeoutError` by `asyncio.TimeoutError`:
  Stefan reported that the CI does not recognize the former:
  https://lists.nongnu.org/archive/html/qemu-block/2022-11/msg00424.html

  As far as I understand, the latter was basically moved to become the
  former in Python 3.11, and an alias remains, so both are basically
  equivalent.  I only have 3.10, though, where the documentation says
  that both are different, even though using either seems to work fine
  (i.e. both catch the timeout there).  Not sure about previous
  versions, but the CI seems pretty certain about not knowing
  `asyncio.exceptions.TimeoutError`, so use `asyncio.TimeoutError`
  instead.  (Even though that is deprecated in 3.11, but this is not the
  first place in the tree to use it, so it should not be too bad.)
---
 .../qemu-iotests/tests/stream-under-throttle  | 121 ++
 .../tests/stream-under-throttle.out   |   5 +
 2 files changed, 126 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-under-throttle
 create mode 100644 tests/qemu-iotests/tests/stream-under-throttle.out

diff --git a/tests/qemu-iotests/tests/stream-under-throttle 
b/tests/qemu-iotests/tests/stream-under-throttle
new file mode 100755
index 00..8d2d9e1684
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-under-throttle
@@ -0,0 +1,121 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test streaming with throttle nodes on top
+#
+# Copyright (C) 2022 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import asyncio
+import os
+from typing import List
+import iotests
+from iotests import qemu_img_create, qemu_io
+
+
+image_size = 256 * 1024 * 1024
+base_img = os.path.join(iotests.test_dir, 'base.img')
+top_img = os.path.join(iotests.test_dir, 'top.img')
+
+
+class TcgVM(iotests.VM):
+'''
+Variant of iotests.VM that uses -accel tcg.  Simply using
+iotests.VM.add_args('-accel', 'tcg') is not sufficient, because that will
+put -accel qtest before -accel tcg, and -accel arguments are prioritized in
+the order they appear.
+'''
+@property
+def _base_args(self) -> List[str]:
+# Put -accel tcg first so it takes precedence
+return ['-accel', 'tcg'] + super()._base_args
+
+
+class TestStreamWithThrottle(iotests.QMPTestCase):
+def setUp(self) -> None:
+'''
+Create a simple backing chain between two images, write something to
+the base image.  Attach them to the VM underneath two throttle nodes,
+one of which has actually no limits set, but the other does.  Then put
+a virtio-blk device on top.
+This test configuration has been taken from
+https://gitlab.com/qemu-project/qemu/-/issues/1215
+'''
+qemu_img_create('-f', iotests.imgfmt, base_img, str(image_size))
+qemu_img_create('-f', iotests.imgfmt, '-b', base_img, '-F',
+iotests.imgfmt, top_img, str(image_size))
+
+# Write something to stream
+qemu_io(base_img, '-c', f'write 0 {image_size}')
+
+blockdev = {
+'driver': 'throttle',
+'node-name': 'throttled-node',
+'throttle-group': 'thrgr-limited',
+'file': {
+'driver': 'throttle',
+'throttle-group': 'thrgr-unlimited',
+'file': {
+'driver': iotests.imgfmt,
+'node-name': 'unthrottled-node',
+'file': {
+'driver': 'file',
+'filename': top_img
+}
+}
+}
+}
+
+# Issue 1215 is not reproducible in qtest mode, which is why we need to
+# create an -accel tcg VM
+self.vm = TcgVM()
+self.vm.add_object('iothread,id=iothr0')
+self.vm.add_object('throttle-group,id=thrgr-unlimited')
+self.vm.add_object('throttle-group,id=thrgr-limited,'
+   

Re: [PATCH v2 5/5] migration: Disable multifd explicitly with compression

2022-11-14 Thread Juan Quintela
Peter Xu  wrote:
> Multifd thread model does not work for compression, explicitly disable it.
>
> Note that previuosly even we can enable both of them, nothing will go
> wrong, because the compression code has higher priority so multifd feature
> will just be ignored.  Now we'll fail even earlier at config time so the
> user should be aware of the consequence better.
>
> Note that there can be a slight chance of breaking existing users, but
> let's assume they're not majority and not serious users, or they should
> have found that multifd is not working already.
>
> With that, we can safely drop the check in ram_save_target_page() for using
> multifd, because when multifd=on then compression=off, then the removed
> check on save_page_use_compression() will also always return false too.
>
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 




  1   2   3   >