On Mon, Nov 08, 2010 at 11:18:34AM +0800, Sheng Yang wrote:
> On Friday 05 November 2010 21:27:42 Michael S. Tsirkin wrote:
> > On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
> > > On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > > > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > > > +};
> > > > > > > +
> > > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > > > handle MSI-X +mask bit in the kernel.
> > > > > > 
> > > > > > What happens on repeated calls when it's already enabled?
> > > > > > How does one disable after it's enabled?
> > > > > 
> > > > > Suppose this should only be called once. But again would update the
> > > > > MMIO base.
> > > > 
> > > > So maybe add this all to documentation.
> > > > 
> > > > > It
> > > > > would be disabled along with device deassignment.
> > > > 
> > > > So what are you going to do when guest enables and later disables MSIX?
> > > > Disable assignment completely?
> > > 
> > > This device goes with PCI resources allocation, rather than IRQ
> > > assignment.
> > 
> > I see. I guess we can live with it, but it seems tied to a specific
> > mode of use. Would be better to support enabling together with msix too,
> > this requires an ability to disable.
> > 
> > > > > We enable it explicitly because
> > > > > not all devices have MSI-X capability.
> > > > > 
> > > > > > > +
> > > > > > 
> > > > > > This is very specialized for assigned devices.  I would like an
> > > > > > interface not tied to assigned devices explicitly
> > > > > > (even if the implementation at the moment only works
> > > > > > for assigned devices, I can patch that in later). E.g. vhost-net
> > > > > > would benefit from such an extension too.  Maybe tie this to a
> > > > > > special GSI and this GSI can be an assigned device or not?
> > > > > 
> > > > > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
> > > > 
> > > > Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
> > > > and kvm_assigned_msix_mmio really.
> > > > 
> > > > > We can't tie it to GSI, because we
> > > > > should also cover entries without GSI. The entry number should be
> > > > > fine for all kinds of devices using MSI-X.
> > > > 
> > > > I don't completely understand: entries without GSI
> > > > never need to inject an interrupt. Why do we care
> > > > about them? Let's pass such accesses to userspace.
> > > > 
> > > > In any case, I think MSIX GSIs are inexpensive (we never search them
> > > > all), so we could create GSIs for all entries if we wanted to.
> > > 
> > > All mask bit accessing controlled by kernel, in order to keep consistent.
> > 
> > Well I think passing whatever we can't handle to userspace makes
> > complete sense. We do this for address/data writes after all.
> > If we wanted to do it *all* in kernel we would handle
> > address/data there too. I think Avi suggested this at some point.
> 
> I think Avi's point was we handle mask bit half in the kernel and half in the 
> userspace, which makes API complex. I agree with this so I moved all mask bit 
> handling to kernel.
> 
> Data/address is another issue, due to QEmu own the routing table. We can 
> change it 
> in the future, but this is not related to this patch.
> > 
> > The main point is msix mmio BAR handling is completely unrelated
> > to the assigned device. Emulated devices have an msix BAR as well.
> > So tying it to an assigned devices is suboptimal, we'll need to
> > grow yet another interface for these.
> 
> If you only means the name, we can change that.

I mean it shouldn't use the assigned device id. Use some other index:
I suggested adding a new GSI type for this, but can be something else
if you prefer.

> But I think all MSI-X entries 
> would have entry number, regardless of if it got GSI number. Keep entry 
> number as 
> parameter makes more sense to me.
> 
> And still, the patch's only current user is assigned device. I am glad if it 
> can 
> cover the other future case along with it, but that's not the primary target 
> of 
> this patch.

It's a problem I think. Stop thinking about the patch for a moment
and think about kvm as a whole. We know we will want such an interface for
emulated and PV devices. What then? Add another interface and a bunch of
code to support it?  Point being, since userspace interfaces need to be
supported forever, they should be generic if possible.


> > 
> > > > > I am not sure about what PV one would looks like.
> > > > > I use kvm_assigned_msix_entry
> > > > > just because it can be easily reused.
> > > > 
> > > > Not only PV, emulated devices too. OK let's try to figure out:
> > > > 
> > > > What happens with userspace is, we inject interrupts either through
> > > > ioctls or eventfds. Avoiding an exit to userspace on MSIX
> > > > access is beneficial in exactly the same way.
> > > > 
> > > > We could have an ioctl to pass in the table addresses, and mapping
> > > > table to relevant GSIs. For example, add kvm_msix_mmio with table
> > > > start/end, also specify which GSIs are covered.
> > > > Maybe ask that userspace allocate all GSIs for a table consequitively?
> > > > Or add kvm_irq_routing_msix which is same as msi but
> > > > also has the mmio addresses? Bonus points for avoiding the need
> > > > to scan all table on each write. For example, don't scan at all,
> > > > instead just set a bit in KVM, and set irq entry on the next
> > > > interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
> > > > byte bitmap would be enough for this, avoding a linear scan.
> > > > 
> > > > When we pass in kvm_msix_mmio, kvm would start registering masked
> > > > state.
> > > > 
> > > > > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > > > > > +
> > > > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > > > +Architectures: x86
> > > > > > > +Type: vm ioctl
> > > > > > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > > > > > +Returns: 0 on success, !0 on error
> > > > > > > +
> > > > > > > +struct kvm_assigned_msix_entry {
> > > > > > > + /* Assigned device's ID */
> > > > > > > + __u32 assigned_dev_id;
> > > > > > > + /* Ignored */
> > > > > > > + __u32 gsi;
> > > > > > > + /* The index of entry in the MSI-X table */
> > > > > > > + __u16 entry;
> > > > > > > + /* Querying flags and returning status */
> > > > > > > + __u16 flags;
> > > > > > > + /* Must be 0 */
> > > > > > > + __u16 padding[2];
> > > > > > > +};
> > > > > > > +
> > > > > > > +This ioctl would allow userspace to get the status of one
> > > > > > > specific MSI-X +entry. Currently we support mask bit status
> > > > > > > querying. +
> > > > > > > 
> > > > > > >  5. The kvm_run structure
> > > > > > >  
> > > > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > > > 
> > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > index f3f86b2..8fd5121 100644
> > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > > > 
> > > > > > >   case KVM_CAP_DEBUGREGS:
> > > > > > >   case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > > > 
> > > > > > >   case KVM_CAP_XSAVE:
> > > > > > > + case KVM_CAP_DEVICE_MSIX_MASK:
> > > > > > >           r = 1;
> > > > > > >           break;
> > > > > > >   
> > > > > > >   case KVM_CAP_COALESCED_MMIO:
> > > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > > index 919ae53..eafafb1 100644
> > > > > > > --- a/include/linux/kvm.h
> > > > > > > +++ b/include/linux/kvm.h
> > > > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > > > 
> > > > > > >  #endif
> > > > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > > > 
> > > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > > > > +#endif
> > > > > > 
> > > > > > We seem to have these HAVE macros all over.
> > > > > > Avi, what's the idea? Let's drop them?
> > > > > 
> > > > > Um? This kind of marcos are used here to avoid vendor specific string
> > > > > in the header file.
> > > > 
> > > > Am I the only one that dislikes the ifdefs we have all over this code
> > > > then? Rest of linux tries to keep ifdefs local by stubbing
> > > > functionality out.
> > > > 
> > > > > > > + BUG_ON(adev->msix_mmio_base == 0);
> > > > > > 
> > > > > > Below I see
> > > > > > 
> > > > > >     adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > > 
> > > > > > which comes from userspace. BUG_ON is an unfriendly way to
> > > > > > tell user about a bug in qemu.
> > > > > > 
> > > > > > Anyway, I don't think we should special-case 0 gpa.
> > > > > > It's up to the user where to base the table.
> > > > > > Use a separate flag to signal that table was initialized.
> > > > > 
> > > > > Base_addr can't be 0 in any case. Only the unimplemented BAR are
> > > > > hardwired to zero.
> > > > 
> > > > No, the PCI spec does not say so. Check it out: the unimplemented BAR
> > > > is hardwired to zero but implemented can be set to 0 as well.
> > > 
> > > OK, OK, OK, I would add one more flag.
> > > 
> > > > > Also if we get here with base_addr = 0, it is surely a bug.
> > > > > 
> > > > > > > +
> > > > > > > + return -EINVAL;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t
> > > > > > > addr, int len, +                    void *val)
> > > > > > > +{
> > > > > > > + struct kvm_assigned_dev_kernel *adev =
> > > > > > > +                 container_of(this, struct 
> > > > > > > kvm_assigned_dev_kernel,
> > > > > > > +                              msix_mmio_dev);
> > > > > > > + int idx, r = 0;
> > > > > > > + u32 entry[4];
> > > > > > > + struct kvm_kernel_irq_routing_entry e;
> > > > > > > +
> > > > > > > + mutex_lock(&adev->kvm->lock);
> > > > > > > + if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > > > +         r = -EOPNOTSUPP;
> > > > > > > +         goto out;
> > > > > > > + }
> > > > > > > + if ((addr & 0x3) || len != 4)
> > > > > > > +         goto out;
> > > > > > > +
> > > > > > > + idx = msix_get_enabled_idx(adev, addr, len);
> > > > > > > + if (idx < 0) {
> > > > > > > +         idx = (addr - adev->msix_mmio_base) / 
> > > > > > > PCI_MSIX_ENTRY_SIZE;
> > > > > > > +         if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > > > +                         PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > > > > +                 *(unsigned long *)val =
> > > > > > > +                         test_bit(idx, adev->msix_mask_bitmap) ?
> > > > > > > +                         PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > > > 
> > > > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > > > all values in little endian format.
> > > > > 
> > > > > Let big endian platform take care of it later... There is only x86
> > > > > has __KVM_HAVE_MSIX now.
> > > > 
> > > > Well it's easier to make the code correct when you write it than later.
> > > 
> > > Go back to me if someone want to implement MSI device assignment on
> > > big-endian machine.
> > 
> > I think it's more elegant to make the code correct the first time around.
> > What are we arguing about? A couple of cpu_to_le macros?
> 
> Sorry, I am really not familiar with big endian machine and related macros. 
> Also 
> lacks of way to test it. Would give it a try.

Actually, there is a way to do some limited check on a nx86 pc:
just tag the entry appropriately:

__le32 entry[4];

(Hmm, isn't there some macro instead of 4?)

And then make sparse not complain about code (make C=1).

> > 
> > > > > > > +         else
> > > > > > > +                 r = -EOPNOTSUPP;
> > > > > > > +         goto out;
> > > > > > > + }
> > > > > > > +
> > > > > > > + r = kvm_get_irq_routing_entry(adev->kvm,
> > > > > > > +                 adev->guest_msix_entries[idx].vector, &e);
> > > > > > > + if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > > > > +         printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > > > > > +                 "idx %d, addr 0x%llx, len %d\n", idx, addr, 
> > > > > > > len);
> > > > > > 
> > > > > > Can a malicious userspace trigger this intentionally?
> > > > > 
> > > > > No.
> > > > 
> > > > Here's a scenario: bind an MSIX routing entry to device, then later
> > > > change its type to regular interrupt. Fills the log with warnings,
> > > > doesn't it?
> > > 
> > > Does QEmu suppose to discard the old routing?
> > 
> > It does not matter. We can't trust userspace to do the right thing.
> 
> OK, let's eliminate this.


Yea. debug or something. Also,
since it's a read, we can pass it up to userspace
which will maybe make it easier to debug.

> --
> regards
> Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to